A General Strategy for Physics-Based 
Model Validation 

Illustrated with Earthquake Phenomenology, 
Atmospheric Radiative Transfer, 
and Computational Fluid Dynamics 

Didier Sornette^'^''^, Anthony B. Davis^, James R. Kamm^, and Kayo Ide^ 

^ Institute of Geophysics and Planetary Physics 

and Department of Earth & Space Sciences 

University of Cahfornia, Los Angeles, CA 90095, USA. 

^ Laboratoire de Physique de la Matiere Condensee (CNRS UMR 6622) 

and Universite de Nice-Sophia Antipolis 

06108 Nice Cedex 2, France 

^ now at D-MTEC, ETH Zurich, CH-8032 Zurich, Switzerland 
dsornetteOethz . ch 

* Los Alamos National Laboratory 
Space & Remote Sensing Group (ISR-2) 
Los Alamos, NM 87545, USA 
adavisSlanl . gov 

^ Los Alamos National Laboratory 

Applied Science & Methods Development Group (X-1) 

Los Alamos, NM 87545, USA 

kamm j Olanl . gov 

^ Institute of Geophysics and Planetary Physics 
and Department of Atmospheric & Oceanic Sciences 
University of California, Los Angeles, CA 90095, USA 
kayoOatmos . ucla . edu 



This article is to be pubhshed the Lecture Notes in Computational Science 
and Engineering, Vol. TBD), Proceedings of in Computational Methods in 
Transport, Granlibakken 2006, F. Graziani and D. Svifesty (Eds.), Springer- 
Verlag, New York (NY), 2007. 



2 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide 

This article is an augmented version of Ref. by Sornette et al. that ap- 
peared in Proceedings of the National Academy of Sciences in 2007 (doi: 
10. 1073/pnas. 0611677104), with an electronic supplement at URL 
[h ttp:/ /www.pnas.org/cgi/content/full/06116771 04/DCl , 
Ref. [l] is also available in preprint form at URL 
|http://arxiv.org/abs/ physics / 0511219 



Summary. Validation is often defined as the process of determining the degree to 
which a model is an accurate representation of the real world from the perspective 
of its intended uses. Validation is crucial as industries and governments depend 
increasingly on predictions by computer models to justify their decisions. In this 
article, we survey the model validation literature and propose to formulate validation 
as an iterative construction process that mimics the process occurring implicitly in 
the minds of scientists. We thus offer a formal representation of the progressive 
build-up of trust in the model, and thereby replace incapacitating claims on the 
impossibility of validating a given model by an adaptive process of constructive 
approximation. This approach is better adapted to the fuzzy, coarse-grained nature 
of validation. Our procedure factors in the degree of redundancy versus novelty of 
the experiments used for validation as well as the degree to which the model predicts 
the observations. We illustrate the new methodology first with the maturation of 
Quantum Mechanics as the arguably best established physics theory and then with 
several concrete examples drawn from some of our primary scientific interests: a 
cellular automaton model for earthquakes, an anomalous diffusion model for solar 
radiation transport in the cloudy atmosphere, and a computational fiuid dynamics 
code for the Richtmyer-Meshkov instability. 
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1 Introduction: Our Position with Respect to Previous 
Work on Validation and Related Concepts 

1.1 Introductory Remarks and Outline 

At the heart of the scientific endeavor, model building involves a slow and arduous 
selection process, which can be roughly represented as proceeding according to the 
following steps: 

1. start from observations and/or experiments; 

2. classify them according to regularities that they may exhibit: the presence of 
patterns, of some order, also sometimes referred to as structures or symmetries, 
is begging for "explanations" and is thus the nucleation point of modeling; 

3. use inductive reasoning, intuition, analogies, and so on, to build hypotheses from 
which a model Q is constructed; 

4. test the model obtained in step 3 with available observations, and then extract 
predictions that are tested against new observations or by developing dedicated 
experiments. 

The model is then rejected or refined by an iterative process, a loop going from step 1 
to step 4. A given model is progressively validated by the accumulated confirmations 
of its predictions by repeated experimental and/or observational tests. 

Building and using a model requires a language, i.e., a vocabulary and syntax, to 
express it. The language can be English or French for instance to obtain predicates 
specifying the properties of and/or relation with the subject (s). It can be mathemat- 
ics, which is arguably the best language to formalize the relation between quantities, 
structures, space and change. It can be a computer language to implement a set of 
relations and instructions logically linked in a computer code to obtain quantitative 
outputs in the form of strings of numbers. In this later version, our primary interest 
here, validation must be distinguished from verification. Whereas verification deals 
with whether the simulation code correctly solves the model equations, validation 
carries an additional degree of trust in the value of the model vis-a-vis experiment 
and, therefore, may convince one to use its predictions to explore beyond known 
territories [2]. 

The validation of models is becoming a major issue as humans are increasingly 
faced with decisions involving complex tradeoffs in problems with large uncertainties, 
as for instance in attempts to control the growing anthropogenic burden on the 
planet within a risk-cost framework [HH] based on predictions of models. For policy 
decisions, national, regional, and local governments increasingly depend on computer 
models that are scrutinized by scientific agencies to attest to their legitimacy and 
reliability. Cognizance of this trend and its scientific implications is not lost on the 
engineering and physics [6] communities. 

Our purpose here is to clarify from a physics-based perspective what validation 
is and to propose a roadmap for the development of systematic approach to physics- 
based validation with broad applications. We will focus primarily on the needs of 
computational fluid dynamics and particle/radiation transport codes. 

^ By model, we understand an abstract conceptual construction based on axioms 
and logical relations developed to extract logical propositions and predictions. 
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In the remainder of this section, we first review different definitions and ap- 
proaches found in the literature, positioning ourselves with respect to selected topics 
or practices pertaining to validation; we then show how the validation problem is 
related to the mathematical statistics of hypothesis testing and discuss some prob- 
lems associated with emergent behaviors in complex systems. In section 2, we list 
and describe qualitatively the elements required in our vision of model validation 
as an iterative process where one strives to build trust in the model going from one 
experiment to the next; however, one must also be prepared to uncover in the model 
a flaw, which may or may not be fatal. We offer in sections 3-4 our quantitative 
physics-based approach to model validation, where the relevance of the experiment 
to the validation process is represented explicitly. (An appendix explores the model 
validation problem more formally and in a broader context.) Section 5 demonstrates 
the general strategy for model validation using the historical development of quan- 
tum physics — a remarkably clear ideal case. Section 6 uses some research interests 
of the present authors to further illustrate the validation procedure using less-than- 
perfect models in geophysics, computational fluid dynamics (CFD), and radiative 
transfer. We summarize in section 7. 

1.2 Standardized Definitions 

The following definitions are given by the American Institute of Aeronautics and 
Astronautics 7 : 

• Model: A representation of a physical system or process intended to enhance our 
ability to predict, control and eventually to understand its behavior. 

• Calibration: The process of adjusting numerical or physical modeling parame- 
ters in the computational model for the purpose of improving agreement with 
experimental data. 

• Verification: The process of determining that a model implementation accurately 
represents the developer's conceptual description of the model and the solution 
of the model. 

• Validation: The process of determining the degree to which a model is an accurate 
representation of the real world from the perspective of the intended uses of the 
model. 

Figure [T] sometimes called a Sargent diagram, shows where validation and several 
other of the above constructs and stages enter into a complete modeling project. 

In the concise phasing of Roache [2], "Verification consists in solving the equa- 
tions right while validation is solving the right equations. " In the context of the 
validation of astrophysical simulation codes, Calder et al. [11] add: "Verification 
and validation are fundamental steps in developing any new technology. For simula- 
tion technology, the goal of these testing steps is assessing the credibility of modeling 
and simulation. " 

Verifications of complex CFD codes usually comprise a suite of standard test 
problems in the field of fluid dynamics [TI]. These include Sod's test [12) . the strong 
shock tube problem [T^, the Sedov explosion problem [l^, the interacting blast 
wave problem T5^, a shock forced through a jump in mesh refinement, and so on. 

Validations of complex CFD codes is usually done by comparison with exper- 
iments testing a variety of physical phenomena, including instabilities, turbulent 
mixing, shocks, etc. Validation requires that the numerical simulations recover the 
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Fig. 1. Schematic representation of the conventional position of validation in model 
construction according to Schlesinger [gj and Sargent [51 [TO]. 



salient qualitative features of the experiments, such as the instabilities, their non- 
linear development, the determination of the most unstable modes, and so on. See, 
for instance, Gnoffo et al. |16) . 

Considerable work on verification and validation of simulations has been done in 
the field of CFD, and in this literature the terms verification and validation have pre- 
cise, technical meanings [71 [21 1171 1^ 110). Verification is taken to mean demonstrating 
that a code or simulation accurately represents the conceptual model. Roache [18] 
stresses the importance of distinguishing between (i) verification of codes and (ii) 
verification of calculations. The former is concerned with the correctness of the code. 
The later deals with the correctness of the physical equations used in the code. The 
programming and methods of solution can be correct (verification (i) successful) 
but they can solve erroneous equations (verification (ii) failure) . Validation of a sim- 
ulation means demonstrating that the simulation appropriately describes Nature. 
The scope of validation is therefore much larger than that of verification and in- 
cludes comparison of numerical results with experimental or observational data. In 
astrophysics, where it is difficult to obtain observations suitable for comparison to 
numerical simulations, this process can present unique challenges. Roache [op. cit.] 
goes on to offer the optimistic prognosis that "the problems of Verification of Codes 
and Verification of Calculations are essentially solved for the case of structured grids, 
and for structured refinement of unstructured grids. It would appear that one higher 
level of algorithm/ code development is required in order to claim a complete method- 
ology for Verification of Codes and Calculations. I expect this to happen. Within 
10 years, and likely much less. Verification of Codes and Calculations ought to be 
settled questions. I expect that Validation questions will always be with us. " We fully 
endorse this last sentence, as we will argue further on that validation is akin to the 
development of "trust" in theories of real phenomena, a never-ending quest. 

1.3 Impossibility Statements 

For these reasons, the possibility of validating numerical models of natural phe- 
nomena, often endorsed either implicitly or identified as reachable goals by natural 
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scientists in their daily work, has been chaUenged; quoting from Oreskes et al. [19j : 
"Verification and validation of numerical models of natural systems is impossible. 
This is because natural systems are never closed and because model results are always 
non-unique. " According to this view, the impossibihty of "verifying" or "vaUdating" 
models is not Umited to computer models and codes but to all theories that rely 
necessarily on imperfectly measured data and auxiliary hypotheses. As Sterman [20] 
puts it: "Any theory is underdetermined and thus unverifiable, whether it is embodied 
in a large-scale computer model or consists of the simplest equations. " Accordingly, 
many uncertainties undermine the predictive reliability of any model of a complex 
natural system in advance of its actual use. 

Such "impossibility" statements are reminiscent of other "impossibility theo- 
rems." Consider the mathematics of algorithmic complexity which provides 
one approach to the study of complex systems. Following reasoning related to that 
underpinning Godel's incompleteness theorem, most complex systems have been 
proved to be computationally irreducible, i.e., the only way to predict their evolu- 
tion is to actually let them evolve in time. Accordingly, the future time evolution of 
most complex systems appears inherently unpredictable. Such sweeping statements 
turn out to have basically no practical value. This is because, in physics and other 
related sciences, one aims at predicting coarse-grained properties. Only by ignor- 
ing most of molecular detail, for example, did researchers ever develop the laws of 
thermodynamics, fluid dynamics and chemistry. Physics works and is not hampered 
by computational irreducibility because we only ask for approximate answers at 
some coarse-grained level [26]. By developing exact but coarse-grained procedures 
on computationally irreducible cellular automata, Israeli and Goldenfeld [27] have 
demonstrated that prediction may simply depend on finding the right level for de- 
scribing the system. More generally, we argue that only coarse-grained scales are 
of interest in practice but their description requires "effective" laws which are in 
general based on finer scales. In other words, real understanding must be rooted 
in the ability to predict coarser scales from finer scales, i.e., a real understanding 
solves the universal micro-macro challenge. Similarly, we propose that validation is 
possible, to some degree, as explained further on. 

1.4 Validation and the Mathematical Statistics of Hypothesis 
Testing 

Calder et al. also write: "We note that verification and validation are necessary 
but not sufficient tests for determining whether a code is working properly or a 
modeling effort is successful. These tests can only determine for certain that a code 
is not working properly. " This last statement is important because it points to a 
bridge between the problem of validation and some of the most central questions of 
mathematical statistics [21], namely, hypothesis testing and statistical significance 
tests. This connection has been made previously by several others authors |29ll30ll5n 
I32| . In showing the usefulness of the concepts and framework of hypothesis testing, 
we depart from Oberkampf and Trucano [38] who mistakenly state that hypothesis 

^ For further debate and commentary by Oreskes and her co-authors, see refs. 
[211 1221 I23j : also noteworthy is the earlier paper by Konikov and Bredehoeft 
[24] for a statement about validation impossibility in the context of groundwater 
models. 
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testing is a true or false issue, only. Every test of significance begins with a "null" 
hypothesis Ho, which represents a theory that has been put forward, either because 
it is believed to be true or because it is to be used, but has not been proved. 

For example, in a clinical trial of a new drug, the null hypothesis might be: "the 
new drug is no better, on average, than the current drug." We would write Ho: "there 
is no difference between the two drugs on average." The alternative hypothesis Hi is 
a statement of what a statistical hypothesis test is set up to establish. In the example 
of a clinical trial of a new drug, the alternative hypothesis might be that the new 
drug has a different effect, on average, to be compared to that of the current drug. 
We would write Hi: the two drugs have different effects, on average. The alternative 
hypothesis might also be that the new drug is better, on average, than the current 
drug. Once the test has been carried out, the final conclusion is always given in 
terms of the null hypothesis. We either "reject Ho in favor of Hi" or "do not reject 
Hq." We never conclude "reject Hi," or even "accept Hi." If we conclude "do not 
reject Ho," this does not necessarily mean that the null hypothesis is true, it only 
suggests that there is not sufficient evidence against Ho in favor of Hi ; rejecting the 
null hypothesis then suggests that the alternative hypothesis may be true, or is at 
least better supported by the data. Thus, one can never prove that an hypothesis 
is true, only that it is wrong by comparing it with another hypothesis. One can 
also conclude that "hypothesis Hi is not necessary and another, more parsimonious, 
one Ho should be favored." The alternative hypothesis Hi is not rejected, strictly 
speaking, but is found unnecessary or redundant with respect to Ho. This is the 
situation when there are two (or several) alternative hypotheses Ho and Hi, which 
can be composite, nested, or non-nested. 

Within this framework, the above-mentioned statement by Oreskes et al. [19) 
that verification and validation of numerical models of natural systems is impossible 
is hardly news: the theory of statistical hypothesis testing has taught mathematical 
and applied statisticians for decades that one can never prove an hypothesis or a 
model to be true. One can only develop an increasing trust in it by subjecting it to 
more and more tests that "do not reject it." We attempt to formalize below how 
such trust can be increased to lead to an asymptotic validation. 



1.5 Code Comparison 

The above definitions are useful in recasting the role of code comparison in verifi- 
cation and validation (Code Comparison Principle or CCP). Trucano et al. |35) are 
unequivocal on this practice: "the use of code comparisons for validation is improper 
and dangerous. " We propose to interpret the meaning of CCP for code verification 
activities (which has been proposed in this literature) as parallel to the problem of 
hypothesis testing: Can one reject Code #1 in favor of Code #2? In this spirit, the 
CCP is nothing but a reformulation in the present context of the fundamental prin- 
ciple of hypothesis testing. Viewed in this way, it is clear why CCP is not sufficient 
for validation since validation requires comparison with experiments and several 



^ We refer the reader to V.J. Easton and J.H. McCoU, Statistics Glossary, 

IT 



http://www.cas.lancs.ac.Uk/glossary_vl.l/main.html from which we have bor- 
rowed liberally for this brief summary. 

The technical difficulties of hypothesis testing depend on these nested structures 
of the competing hypotheses; see, for instance, Gourieroux and Monfort [34| . 
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other steps described below. The analogy with hypothesis testing illuminates what 
CCP actually is: CCP allows the selection of one code among several codes (at least 
two) but does not help one to draw conclusions about the validity of a given code 
or model when considered as a unique entity independent of other codes or models. 
Thus, the fundamental problem of validation is more closely associated with the 
other class of problems addressed by the theory of hypothesis testing, which consists 
in the so-called "tests of significance" where one considers only a single hypothesis 
Ho, and the alternative is "all the rest," i.e., all hypotheses that differ from Hq. In 
that case, the conclusion of a test can be the following: "this data sample does not 
contradict the hypothesis Ho," which is not the same as "the hypothesis Ho is true." 
In other words, an hypothesis cannot be excluded because it is found sufficient at 
some confidence level for explaining the available data. This is not to say that the 
hypothesis is true. It is just that the available data is unable to reject said hypoth- 
esis. Restating the same thing in a positive way, the result of a test of significance 
is that the hypothesis Ho is "compatible with the available data." 

It is implicit in the above discussion that, to compare codes quantitatively in a 
meaningful way, they must solve the same set of equations using difi^erent algorithms, 
and not just model the same physical system. Indeed, there is nothing wrong with 
"validating" a numerical implementation of a knowingly approximate approach to a 
given physical problem. For instance, a (duly verified) diffusion/Pi transport code 
can be validated against a detailed Monte Carlo or Sn code. The more detailed model 
must in principle be validated against real-world data. In turn, it provides validation 
"data" to the coarser model. Naturally, the coarser (say, Pi transport) model still 
needs to establish its relevance to the real world problem of interest, preferably by 
comparison with real observations, or at least be invoked only in regimes where it 
is known a priori to be sufficiently accurate based on comparison with a finer (say, 
Monte Carlo transport) model. 

Two noteworthy initiatives in transport model comparison for non-nuclear appli- 
cations are the Intercomparison of 3D Radation Codes (I3RC) [36] (i3rc.gsfc.nasa.gov) 
and the RAdiation Model Intercomparison (RAMI) [371138] (rami-benchmark.jrc.it). 
The former is focused on the challenge of 3D radiative transfer in the cloudy at- 
mosphere while the later is about 3D radiative transfer inside plant canopies; both 
efforts are motivated by issues in remote sensing (especially from space) and radia- 
tive energy budget estimation (either in the framework of climate modeling or using 
observational diagnostics, which typically means more remote sensing) . [f| Much has 

* We should stress that the Sandia Report [3^ by Trucano et al. presents an even 
more negative view of code comparisons because it addresses the common practice 
in the simulation community that turns to code comparisons rather than bone 
fide verification or validation, without any independent referents. 

^ In remote sensing science, transport theory (for photons) plays a central role 
and "validation" has a special meaning, namely, the estimation of uncertainty 
for remote sensing products based on "ground-truth," i.e., field measurements 
of the very same geophysical variables (e.g., surface temperature or refiectivity, 
vegetation productivity, soil moisture) that the satellite instrument is designed to 
quantify. These data are collected at the same location as the imagery, if possible, 
at the precision of a single pixel. This type of validation exercise will test both 
the "forward" radiation transport theory and its "inversion." Atmospheric remote 
sensing, particularly of clouds, poses a special challenge because, strictly-speaking, 
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been learned by the modelers participating in these code comparison studies, and 
the models have been improved on average [39) . Although not connected so far to 
the engineering community that is at the forefront of V&V standardization and 
methodology, the I3RC and RAMI communities talk much about "testing," and 
sometimes "certification," and not so much about "verification" (which would be 
appropriate) or "validation" (which would not). 

What about multi-physics codes such as those used routinely in astrophysics, 
nuclear engineering, or climate modeling? CCP, along with the stern warnings of 
Trucano et al. [35], applies here, too. Even assuming that all the model compo- 
nents are properly verified or even individually validated, the aggregated model is 
likely to be too complex to talk about clean verification through output comparison. 
Finding some level of agreement between two or more complex multi-physics models 
will naturally build confidence in the whole (community-wide) modeling enterprise. 
However, this is not to be interpreted as validation of any or all of the individual 
models. 

There are many reasons for wanting to have not just one model on hand but a 
suite of more or less elaborate ones. A typical collection can range from the mathe- 
matically and physically exact but numerically intractable to the analytically solv- 
able, possibly even on the proverbial back-of-an-envelope. We elaborate on and il- 
lustrate this kind of hierarchical modeling effort in section IA.2I of the Appendix, 
offering it as an approach where model development is basically simultaneous with 
its validation. 

1.6 Relations Between Validation, Calibration and Data 
Assimilation 

As previously stated, validation can be characterized as the act of quantifying the 
credibility of a model to represent phenomena of interest. Virtually all such models 
contain numerical parameters, the precise values of which are not known a priori and, 
therefore, must be assigned. Calibration is the process of adjusting those parameters 
to optimize (in some sense) the agreement between the model results and a specific 
set of experimental data. Such data necessarily have uncertainties associated with 
them, e.g., due to natural variability in physical phenomena as well as to unavoidable 
imprecision of diagnostics. Likewise, there are intrinsic errors associated with the 
numerical methods used to evaluate many models, e.g., in the approximate solutions 
obtained from discretization schemes applied to partial differential equations. The 
approach of defensibly prescribing parameters for complex physical phenomena while 
incorporating the inescapable variability in these values is called "calibration under 
uncertainty," [JO' a field that poses non-trivial challenges in its own right. 

However calibration is approached, it must be undertaken using a set of data — 
ideally from specifically chosen calibration experiments/observations [41] — that dif- 

there is no counterpart of ground-truthing. One must therefore often make do 
with comparisons of ground-based and space-based remote-sensing (say, of the 
column-integrated aerosol burden) to quantify uncertainty in both operations. In- 
situ measurements (temperature, humidity, cloud liquid water, etc.) from airborne 
platforms — balloon or aircraft — are always welcome but collocation is rarely close 
enough for point-to-point comparisons; statistical agreement is then all that is to 
be expected, and residuals provide the required uncertainty. 
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fers from the physical configurations of uhimate interest (i.e., against which the 
model will be validated). In order to ensure that validation remains independent of 
calibration, it is imperative that these data sets be disjoint. In the case of large, 
complex, and costly experiments encountered in many real-world applications, it 
can be difficult to maintain a scientific "demilitarized zone" between calibration 
and validation. To not do so, however, risks undermining the scientific integrity of 
the associated modeling enterprise, the potential predictive power of which may 
rapidly wither as the validation study devolves into a thinly disguised exercise in 
calibration. 

For complex systems, there are many choices to be made regarding experimental 
and numerical studies in both validation and calibration. The high-level approach 
of the Phenomena Identification and Ranking Table (PIRT) [42] can be used to 
heuristically characterize the nature of one's interest in complicated systems. This 
approach uses expert knowledge to identify the phenomenological components in 
a system of interest, to rank their (relative) perceived importance in the overall 
system, and to gauge the (relative) degree to which these component phenomena 
are perceived to be understood. This rough-and-ready approach can be used to 
target the choice of validation experiments for the greatest scientific payoff on fixed 
experimental and simulation budgets. To help guide calibration activities, one can 
apply the quantitative techniques of sensitivity analysis to rank the relative impact of 
input parameters on model outcome. Such considerations are particularly important 
for complex models containing many adjustable parameters, for which it may prove 
impossible to faithfully calibrate all input parameters. 

Saltelli et al. [431 144] have championed "sensitivity analysis" methods, which 
come in two basic flavors and many variations. One class of methods uses exact or 
numerical evaluation of partial derivatives of model output deemed important with 
respect to input parameters to seek regions of parameter space that might need 
closer examination from the standpoints of calibration and/or validation. If the 
model has time dependence, one can follow the evolution of how parameter choices 
influence the outcome. The alternate methodology uses adjoint dynamical equations 
to determine the relative importance of various parameters. The publications of 
Saltelli et al. provide numerous examples illustrating the value and practical impact 
of sensitivity analysis, as well as references to the wide scientific literature on this 
subject. The results of numerical studies guided by sensitivity analysis can be used 
both to focus experimental resources on high-impact experimental studies and to 
steer future model development efforts. 

In dynamical modeling, initial conditions can be viewed as parameters and, as 
such, they need to be determined optimally from data. If the dynamical system in 
question is evolving continuously over time and data become available along the 
trajectory of the dynamical system, the problem of finding a single initial condition 
over the entire trajectory becomes increasingly and exceedingly difficult as the time 
window of the trajectory extends. In fact, it is practically impossible for the systems 
like the atmosphere or ocean whose dynamics is highly nonlinear, high-dimensional 
model is undoubtedly imperfect, and inhomogeneous and sporadic data are subject 
to (poorly understood) errors. 

Data assimilation is an approach that attends to this problem by breaking up the 
trajectory over (fixed-length) time windows and solving the initialization problem 
sequentially over one time window at a time as data become available. A novelty 
of data assimilation is that, rather than solving the initialization problem from 
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scratch, it uses the model forecast as the first guess (the prior) of the initialization 
(optimization) problem. Once the optimization is completed, the optimal solution 
(the posterior) becomes the initial condition for the next model forecast. 

This iterative Bayesian approach to data assimilation is most effective when the 
uncertainties in both the prior and the data are accurately quantified, as the system 
evolves over time and the data assimilation iterates one cycle after another. This is 
a non-trivial problem, because it requires the estimate of not only the model state 
but also the uncertainties associated with it, as well as the proper description of the 
uncertainties in data. 

Numerical weather prediction (NWP) is one of the most familiar application ar- 
eas of data assimilation — one with major societal impact. The considerable progress 
in skill of the NWP in recent decades has been due to improvements in all aspects 
of data assimilation |45) . i.e., modeling of the atmosphere, quality and quantity of 
data, and data assimilation methods. At the time of writing, most operational NWP 
centers use the so-called the "three-dimensional variational method" (3D-Var) [46], 
which is an economical and accurate statistical interpolation scheme that does not 
include the effect of uncertainty in the forecast. Some centers have switched to the 
"four-dimensional variational method" (4D-Var) [47] , which incorporates the evolu- 
tion of uncertainty in linear sense by the used of the adjoint model of the highly 
nonlinear model. These variational methods always call for the minimization of a 
cost function (cf. Appendix) that measures the difference between model results and 
observations throughout some relevant region of space and time. Currently active 
research areas in data assimilation include the effective and efficient quantification 
of the time- dependent uncertainties of both the prior and posterior in the analysis. 
To this end, the ensemble Kalman filter methods have recently received considerable 
attention motivated by future integration into operational environments [481 1491 [50] . 
As the importance of the uncertainties in data assimilation have become clear, many 
NWP centers perform ensemble prediction along with the single analysis obtained 
by the variational methods [511 1521 153[ . 

Clearly, considerable similarities exist between the data assimilation problem 
and the model validation problem. Can successful data assimilation be construed 
as validation of the model? In our opinion, that would be unjustified because the 
objectives are clearly different for these problems. As stated above, data assimilation 
admits the imperfection of the model. It explicitly makes use of the knowledge from 
the previous data assimilation cycle. As the initialization problem is solved itera- 
tively over relatively short time windows, deviation of the model trajectory from the 
true evolution of the dynamical system in question tend to be small and data could 
be assimilated into the model without much discrepancy. Moreover, the operational 
centers perform careful quality-control of data to eliminate any isolated "outliers" 
with respect to the model trajectory. Thus, the data assimilation problem differs 
from the validation problem by design. Nevertheless, it is important to recognize 
that the resources offered by data assimilation can ensure that models perform well 
enough for their intended use. 

1.7 Extension of the Meaning of Validation 

A qualitatively new class of problems arise in fields such as the geosciences that 
deal with the construction of knowledge of a unique object, planet Earth, whose full 
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scope and range of processes can be replicated or controlled neither in the labora- 
tory nor in a supercomputer. This has led recently to championing the relevance of 
"systemic" (meaning "system approach") also called "complex system" approaches 
to the geosciences. In this framework, positive and negative feedbacks (and even 
more complicated nonlinear multiplicative noise processes) entangle many different 
mechanisms, whose impact on the overall organization can be neither assessed nor 
understood in isolation. How does one validate a model using the systemic approach? 
This very interesting and difficult question is at the core of the problem of validation. 

How does one validate a model when it is making predictions on objects that are 
not fully replicated in the laboratory, either in the range of variables, of parameters, 
or of scales? For instance, this question is crucial 

• in the scaling the physics of material and rock rupture tested in the laboratory 
to the scale of earthquakes; 

• in the scaling the knowledge of hydrodynamical processes quantified in the labo- 
ratory to the length and time scales relevant to the atmospheric/oceanic weather 
and chmate, not to mention astrophysical systems; 

• in the science-based stewardship of the nuclear arsenal, where the challenge is 
to go from many component models tested at small scales in the laboratory to 
the full-scale explosion of an aging nuclear weapon. 

The same issue arises in the evaluation of electronic circuits. In 2003, Allen 
R. Hefner, Founder and Chairman of the NIST/IEEE Working Group on Model 
Validation, writes in its description: "The problem is that there is no systematic 
way to determine the range of applicability of the models provided within circuit 
simulator component libraries. " See full-page boxed text for the complete version of 
this interesting text, as well as Ref. |54) . This example of validation of electronic 
circuits is particularly interesting because it stresses the origin of the difficulties 
inherent in validation: the fact that the dynamics are nonlinear and complex with 
threshold effects and does not allow for a simple-minded analytic approach consisting 
in testing a circuit component by component. Extrapolating, this same difficulty is 
found in validating general circulation models of the Earth's climate or computer 
codes of nuclear explosions. The problem is thus fundamentally a "system" problem. 
The theory of systems, sometimes referred to as the theory of complex systems, is still 
in its infancy but has shown the existence of surprises. The biggest surprise may be 
the phenomenon of "emergence" in which qualitatively new processes or phenomena 
appear in the collective behavior of the system, while they cannot be derived or 
guessed from the behavior of each element. The phenomenon of "emergence" is 
similar to the philosophical law on the "transfer of the quantity into the quality." 
How does one validate a model of such a system? Validation therefore requires an 
understanding of this emergence phenomenon. 

From another angle, the problem is that of extrapolating a body of knowledge, 
which is firmly established only in some limited ranges of variables, parameters 
and scales, beyond this clear domain into a more fuzzy zone of unknowns. This 
problem has appeared and appears again and again in different guises in practically 
all scientific fields. A particularly notable domain of application is risk assessment; 
see, for instance, Kaplan and Garrick's classic paper on risks [55] , and the instructive 
history of quantitative risk analysis in US regulatory practice [56], especially in the 
US nuclear power industry [57l[58l[59l[60]. An acute question in risk assessment deals 
with the question of quantifying the potential for a catastrophic event (earthquake. 
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NIST Working Group on Model Validatios 
Allen R. Hefcer, Ph.D. 
Semiconductor Eleetmma: Division 
National Institute of Standards and Technology 
GaithersburR, MD 20899, USA 

The primary objective of the working group is to establish well-defined procedures for 
the comprehensive evaluation of circuit simulator models. The working group has been 
established because there is presently no systematic way to determine the range of 
applicability of the models provided within circuit simulator component libraries. In 
addition, the accuracy of component models varies fiom one circuit simulator to another 
due to the inclusion of different physical mechanisms in the generic models. The accuracy 
also varies due to the different methods used to determine the model parameters for specific 
device part numbers. 

The complex behavior of electronic devices prohibits individual users of circuit 
simulators from evaluating the proper inclusion of model physics and from determining the 
validity of approximations made to simplify simulator implementation. In addition, 
software vendors are often reluctant to provide a complete description of their model 
equations and model parameters to users. Therefore, the goal of the NIST Working Group 
on Model Validation is to estitblish experimental test procedures that can be used to 
comprehensively evaluate circuit simulator models independently of the model equations, 
the simulator, or the model parameter extraction techniques. 

Traditionally, most circuit simulator model evaluation has been performed by comparing 
simulations with measurements for the easily measured steady-state output characteristics 
md capacitance -voltage characteristics. However, this type of evaluation is of little use in 
determining the ability of the models to describe the more important dynamic 
characteristics. Presently, the only dynamic evaluation performed for circuit simulator 
models is for narrow ranges of conditions that tend to focus on those physical effects that 
are included in particular tnodels and not on comprehensive evaluation of the model's 
ability to describe the dynamic behavior of the device for the fuU range of application 
conditions. 

The primary tasks of the working group axe to determine the complete range of dynamic 
conditions that must be described for each device type, and to then develop well-defined 
test procedures to evaluate the ability of the models to describe each type of dynamic 
condition. It is envisioned that each device type will require different test procedures, and 
that the circuit parameters for the test procedures will be determined for each part number 
based upon the device manufacturer's ratings and suggested appUcation conditions. 
Furthermore, the model vaMdation test procedures should evolve to account for new device 
variations, modeling requirements for new apphcations, and to prevent the development of 
models that are designed only to best fit the standard test procedures. 

Typical model validation test procedures are expected to consist of test circuits that 
resemble application conditions, but that are simplified so that the test system is well 
characterizable and is able to isolate the important features of the device characteristics. 
Ultimately, standard characterization procedures will be established that specify the 
methods used to construct the test circuits to minimize the influence of parasitics, as well as 
the procedure for determining the circuit parameters based upon the device manufacturer's 
ratings. These standard test circuits could then be readily used to compare measured 
dynamic characteristics with those predicted by different circuit simulator component 
models. A database of the comparison results could also be maintained. 

For the working group to be successful, expertise in the following areas will be essential: 
electronic component design and manufacturing, model development, software 
development, component characterization, and circuit and system design. The success of 
this working group could result in an improved understanding of the vahdity and 
limitations of existing circuit simulator component models. This could lead to the 
development of improved circuit simulator models and an increased confidence in the 
ability of the CAD tools to aid in the design of electronic systems. 
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tornado, hurricane, flood, huge solar mass ejection, large meteorite, industrial plant 
explosion, ecological disaster, financial crash, economic collapse, etc.) of amplitude 
never yet sampled from the knowledge of past history and present understanding. 

To tackle this enduring question, each discipline has developed its own strategies, 
often being unaware of the approaches of others. Here, we attempt a formulation 
of the problem, and outline some general directions of attack, that hopefully will 
transcend the specificities of each discipline. Our goal is to formulate the validation 
problem in a way that may encourage productive crossings of disciplinary lines 
between different fields by recognizing the commonalities of the blocking points, 
and suggest useful guidelines. 



2 Validation as a Constructive Iterative Process 

In a generic exercise in model validation, one performs an experiment and, in parallel, 
runs the calculations with the available model. A comparison between the measure- 
ments of the experiment and the outputs of the model calculations is then performed. 
This comparison uses some metrics controlled by experimental feasibility, i.e., what 
can actually be measured. One then iterates by refining the model until (admittedly 
subjective) satisfactory agreement is obtained. Then, another set of experiments is 
performed, which is compared with the corresponding predictions of the model. If 
the agreement is still satisfactory without modifying the model, this is considered 
progress in the validation of the model. Iterating with experiments testing different 
features of the model corresponds to mimicking the process of construction of a 
theory in physics [61) . As the model is exposed to increasing scrutiny and testing, 
the testers develop a better understanding of the reliability (and limitations) of the 
model in predicting the outcome of new experimental and/or observational set-ups. 
This implies that "validation activity should be organized like a project, with goals 
and requirements, a plan, resources, a schedule, and a documented record" [^. 

Extending previous work |29l I3UI 1311 I32j . we thus propose to formulate the 
validation problem of a given model as an iterative construction that embodies the 
often implicit process occurring in the minds of scientists: 

1. One starts with an a priori trust quantified by the value Vprior in the potential 
value of the model. This quantity captures the accumulated evidence thus far. 
If the model is new or the validation process is just starting, take V^rior = 1- 
As we will soon see, the absolute value of Vprior is unimportant but its relative 
change is important. 

2. An experiment is performed, the model is set-up to calculate what should be 
the outcome of the experiment, and the comparison between these predictions 
and the actual measurements is made either in model space or in observation 
space. The comparison requires a choice of metrics. 

3. Ideally, the quality of the comparison between predictions and observations is 
formulated as a statistical test of significance in which an hypothesis (the model) 
is tested against the alternative, which is "all the rest." Then, the formulation of 
the comparison will be either "the model is rejected" (it is not compatible with 
the data) or "the model is compatible with the data." In order to implement this 
statistical test, one needs to attribute a likelihood p(M|yobs) or, more generally, 
a metric-based "grade" that quantifies the quality of the comparison between 
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the predictions of the model M and observations i/obs- This grade is compared 
with the reference likelihood q of "all the rest." Examples of implementations 
include the sign test and the tolerance interval methods. In many cases, one 
does not have the luxury of a likelihood; one has then to resort to more empirical 
assessments of how well the model explains crucial observations. In the most 
complex cases, the outcome can be binary (accepted or rejected). 
4. The posterior value of the model is obtained according to a formula of the type 

l^postcrior/V^prior = -Fb(A'/|2/obs),q;Cnovoll • (1) 

In this expression, V'postcrior is the posterior potential, or coefficient, of trust in 
the value of the model after the comparison between the prediction of the model 
and the new observations have been performed. By the action of F[- ■ ■], Vposterior 
can be either larger or smaller than V^rior: in the former case, the experimental 
test has increased our trust in the validity of the model; in the later case, the 
experimental test has signaled problems with the model. One could call Vprior 
and Vpostorior the evolving "potential value of our trust" in the model or, loosely 
paraphrasing the theory of decision making in economics, the "utility" of the 
model [63| . 

The transformation from the potential value Vprior of the model before the experi- 
mental test to ^posterior after the test is embodied into the multiplier F, which can 
be either larger than 1 (towards validation) or smaller than 1 (towards invalida- 
tion). We postulate that F depends on the grade p(M|j/obs), to be interpreted as 
proportional to the probability of the model M given the data yobs. It is natural 
to compare this probability with the reference likelihood q that one or more of all 
other conceivable models is compatible with the same data. 

Our multiplier F depends also on a parameter Cnovei that quantifies the impor- 
tance of the test. In other words, Cnovei is a measure of the impact of the experiment 
or of the observation, that is, how well the new observation explores novel "dimen- 
sions" of the parameter and variable spaces of both the process and the model that 
can reveal potential flaws. A fundamental challenge is that the determination of 
Cnovei requires, in some sense, a pre-existing understanding of the physical processes 
so that the value of a new experiment can be fully appreciated. In concrete situa- 
tions, one has only a limited understanding of the physical processes and the value 
of a new observation is only assessed after a long learning phase, after comparison 
with other observations and experiments, as well as after comparison with the model, 
making Cnovci possibly self-referencing. Thus, we consider Cnovoi as a judgment-based 
weighting of experimental referents, in which judgment (for example, by a sub- 
ject matter expert) is dominant in its determination. The fundamental problem is 

Pal and Makai [62] have used the mathematical statistics of hypothesis testing 
as a way to validate the correctness of code simulating the operation of a com- 
plex system with respect to a level of confidence for safety problems. The main 
conclusion is that the testing of the input variables separately may lead to incor- 
rect safety related decisions with unforeseen consequences. They have used two 
statistical methods: the sign test and the tolerance interval methods for testing 
more than one mutually dependent output variables. We propose to use these and 
similar tests delivering a probability level p which can then be compared with a 
pre-defined likelihood level q. 
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to quantify the relevance of a new experimental referent for validation to a given 
decision-making problem, given that the experimental domain of the test does not 
overlap with the application domain of the decision. Assignment of Cnovci requires 
the judgment of subject matter experts, whose opinions will likely vary. This vari- 
ability must be acknowledged (if not accounted for, however naively) in assigning 
Cnovei- Thus, providing an a priori value for Cnovei, as required in expression U]), 
remains a difficult and key step in the validation process. This difficulty is similar 
to specifying the utility function in decision making [63j . 

Repeating an experiment twice is a special degenerate case since it amounts ide- 
ally to increasing the size of the statistical sample. In such a situation, one should 
aggregate the two experiments 1 and 2 (yielding the relative likelihoods pi/q and 
respectively) graded with the same Cnovci into an effective single test with the 
same Cnovoi and likelihood {pi/q){p2/q)- This is the ideal situation, as there are cases 
where repeating an experiment may wildly increase the evidence of systemic uncer- 
tainty or demonstrate uncontrolled variability or other kinds of problems. When 
this occurs, this means that the assumption that there is no surprise, no novelty, in 
repeating the experiment is incorrect. Then, the two experiments should be treated 
so as to contribute two multipliers F's, because they reveal different kinds of uncer- 
tainty that can be generated by ensembles of experiments. 

One experimental test corresponds to a entire loop 1 — 4 transforming a given 
Vprior to a V^poaterior accordiug to ([T]) . This ^posterior becomcs the new Vprior for the 
next test, which will transform it into another ypostorior and so on, according to the 
following iteration process: 

V^d.) ^V'-'l . =1/'".' ~>V^^\ ■ =y'^.> ^ ^y^"' . (2) 

prior posterior prior posterior prior posterior V / 

After n validation loops, we have a posterior trust in the model given by[f| 



posterior j-, 

prior 



[p'^^(M|yW),,«;cilJ ■ ■ ■ F [p(")(M|,<^)), g*"); cj^lj , (3) 



where the product is time-ordered since the sequence of values for c\^^^^i depend 
on preceding tests. Validation can be said to be asymptotically satisfied when the 
number of steps n and the final value V'postorior sufficiently high. How high is high 
enough is subjective and may depend on both the application and programmatic 
constraints. The concrete examples discussed below offer some insight on this issue. 



This sequence is reminiscent of a branching process: most of the time, after 
the first or second validation loop, the model will be rejected if V'posterior b®" 
comes much smaller than V^^or- The occurrence of a long series of validation 
tests is specific to those rare models/codes that happen to survive. We conjecture 
that the nature of models and their tests make the probability of survival up to 
level n a power law decaying as a function of validation generation number n: 
[^prstorior - ^priir] ~ l/'^'^ ) ^01 large n. The exponent r = 3/2 in mean-field 
branching processes |64) : being an ensemble average over random test outcomes, 
we expect this to be only an upper bound for actual validation processes. The four 
illustrative examples provided further on, augmented with a fifth one described in 
Ref. [1], yield r ~ 0.85 for 3 < n < 7 with just one outlier. Although the sample 
of models is tiny, this illustrates our point. 
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This construction makes clear that there is no absolute validation, only a process of 
corroborating or disproving steps competing in a global valuation of the model under 
scrutiny. The product Q expresses the assumption that successive observations 
give independent multipliers. This assumption keeps the procedure simple because 
determining the dependence between different tests with respect to validation would 
be highly undetermined. We propose that it is more convenient to measure the 
dependence through the single parameter c^^^^^i quantifying the novelty of the jth 
test with respect to those preceding it. In full generality, each new F multiplier 
should be a function of all previous tests. 

The loop 1 — 4 together with expression ([T} are offered as an attempt to quantify 
the progression of the validation process. Eventually, when one has performed several 
approximately independent tests exploring different features of the model and of 
the validation process, Vposterior has grown to a level at which most experts will be 
satisfied and will believe in the validity of (i.e., be inclined to trust) the model. This 
formulation has the advantage of viewing the validation process as a convergence 
or divergence built on a succession of steps, mimicking the construction of a theory 
of reality. |f| Expression (|3]) embodies the progressive build-up of trust in a model 
or theory. This formulation provides a formal setting for discussing the difficulties 
that underlay the so-called impossibilities |19l 121) in validating a given model. Here, 
these difficulties are not only partitioned but quantified: 

• in the definition of "new" non-redundant experiments (parameter Cnovci), 

• in choosing the metrics and the corresponding statistical tests quantifying the 
comparison between the model and the measurements of this experiment (leading 
to the likelihood ratio p/q), and 

• in iterating the procedure so that the product of the gain/loss factors ■ ■ ] 
obtained after each test eventually leads to a clear-cut conclusion after several 
tests. 

This formulation makes clear why and how one is never fully convinced that vali- 
dation has been obtained: it is a matter of degree, of confidence level, of decision 
making, as in statistical testing. But this formulation helps in quantifying what new 
confidence (or distrust) is gained in a given model. It emphasizes that validation is 
an ongoing process, similar to the never-ending construction of a theory of reality. 

The general formulation proposed here in terms of iterated validation loops is 
intimately linked with decision theory based on limited knowledge: the decision to 
"go ahead" and use the model is fundamentally a decision problem based on the 
accumulated confidence embodied in Vposterior- The "go/no-go" decision must take 
into account conflicting requirements and compromise between different objectives. 
Decision theory was created by the statistician Abraham Wald in the late forties [65] , 
but is based ultimately on game theory [631 166j . Wald used the term loss function, 
which is the standard terminology used in mathematical statistics. In mathemati- 
cal economics, the opposite of the loss (or cost) function gives the concept of the 
utility function, which quantifies (in a specific functional form) what is considered 
important and robust in the fit of the model to the data. We use Vposterior in an 

^ It is conceivable that a new and radically different observation/experiment may 
arise and challenge the built-up trust in a model; such a scenario exemplifies how 
any notion of validation "convergence" is inherently local. 
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even more general sense than "utility," as a decision and information-based valua- 
tion that supports risk-informed decision-making based on "satisficing" (see the 
concrete examples discussed below). 

It may be tempting to interpret the above formulation of the validation problem 
in terms of Bayes' theorem 

where Pr(DatalM) is the likelihood of the data given the model M, and Pr(Data) is 
the unconditional likelihood of the data. However, we can not make immediate sense 
of Pr(Data). Only when a second model M' is introduced can we actually calculate 

Pr(Data) =pprior(Af) Pr(Data|Af) +pprior(M') Pr(Data|M') . (5) 

In other words, Bayes' formulation requires that we set a model/hypothesis in oppo- 
sition to another or other ones, while we examine here the case of a single hypothesis 
in isolation. 

We therefore stress that one should resist the urge to equate our Vprior and 
VJjostcrior with Pprior and Ppostcrior bccause they are not probabilities. It is not possible 
to assign a probability to an experiment in an absolute way and thus Bayes' theorem 
is mute on the validation problem as we have chosen to formulate it. Rather, we 
propose that the problem of validation is fundamentally a problem of decision theory: 
at what stage is one willing to bet that the code will work for its intended use? At 
what stage, are you ready to risk your reputation, your job, the lives of others, your 
own life on the fact that the model/code will predict correctly the crucial aspect of 
the real-life test? One must therefore incorporate ingredients of decision theory, and 
not only fully objective probabilities. Coming from a Bayesian perspective, pprior 
and Ppostcrior could then be called the potential value or trust in the model/code or, 
as we prefer, to move closer to the application of decision theory in economics, the 
utility of the model/code [63] . 

To summarize the discussion so far, expression ^ may be reminiscent of a 
Bayesian analysis, however, it does not manipulate probabilities. (Instead, they ap- 
pear as independent variables, viz., p(M|2/obs) and q.) In the Bayesian methodology 
of validation (69n70j . only comparison between models can be performed due to the 
need to remove the unknown probability of the data in Bayes' formula. In contrast, 
our approach provides a value for each single model independently of the others. 
In addition, it emphasizes the importance of quantifying the novelty of each test 
and takes a more general view on how to use the information provided from the 
goodness-of-fit. The valuation ([T} of a model uses probabilities as partial inputs, not 
as the qualifying criteria for model validation. This does not mean, however, that 
there are not uncertainties in these quantities or in the terms F, q or Cnovci and that 
aleatory and epistemic uncertainties are ignored, as discussed below. 

In economics, satisficing is a behavior that attempts to achieve at least some 
minimum level of a particular variable, but that does not strive to achieve its 
maximum possible value. The verb "to satisfice" was coined by Herbert A. Simon 
in his theory of bounded rationality [671 168] . 

For an in-depth discussion on aleatory versus systemic (a.k.a. epistemic) uncer- 
tainties, see for example Review of Recommendations for Probabilistic Seismic 
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3 Desirable Properties of the Multiplier of the 
Validation Step 

The multiplier F [p[M\yohs), q; Cnovoi] should have the following properties: 

1. If the statistical test(s) performed on the given observations is (are) passed at 
the reference level g, then the posterior potential value is larger than the prior 
potential value: F > 1 (resp. F < 1) for p > q (resp. p < q), which can be 
written succinctly as ln_F/ln(p/g) > 0. 

2. The larger the statistical significance of the passed test, the larger the posterior 
value. Hence 




for a given q. There could be a saturation of the growth of F for large p/q, 
which can be either that F < oo as p/g ^ cxj or of the form of a concavity 
requirement 



for large p/q: obtaining a quality of fit beyond a certain level should not be 
attempted. 

3. The larger the statistical level at which the test(s) performed on the given 
observations is (are) passed, the larger the impact of a "novel" experiment on 
the multiplier enhancing the prior into the posterior potential value of the model: 
dF/dCnovci > (resp. < 0), for p > g (resp. p < q). 

A very simple multiplier that obeys this these properties (not including the 
saturation of the growth of F) is given by 



and is illustrated in the upper panel of Fig. [2] as a function of p/q and Cnovci. This 
form provides an intuitive interpretation of the meaning of the experiment impact 
parameter Cnovci- A non-committal evaluation of the novelty of a test would be 
Cnovei = 1, thus F = p/q and the chain ((3]) reduces to a product of normalized 
fikelihoods, as in standard statistical tests. A value Cnovei > 1 (resp. < 1) for a 
given experiment describes a nonlinearly rapid (resp. slow) updating of our trust 
y as a function of the grade p/q of the model with respect to the observations. In 
particular, a large value of Cnovci corresponds to the case of "critical" tests. [3 Note 
that the parameterization of Cnovci in ® should account for the decreased novelty 
noted above occurring when the same experiment is repeated two or more times. 
The value of Cnovei should be reduced for each repetition of the same test; moreover, 
the value of Cnovci should approach unity as the number of repetitions increases. 

Hazard Analysis: Guidance on Uncertainty and Use of Experts 71 , available at 
|htt p:/ /www.nap.edu/catalog/5487.html 

A momentous example is the Michelson-Morley experiment for the Theory of 
Special Relativity. For the Theory of General Relativity, it was the observation 
during the famous 1919 solar eclipse of the bending of light rays from distant 
stars by the Sun's mass and the elegant explanation of the anomalous precession 
of the perihelion of Mercury's orbit. 




(6) 




Fig. 2. The multipliers defined by (|6]) and ^ are plotted as functions of p/q and 
Cnovei in the upper and lower panels respectively. Note the vertical log scale used for 
the multiplier Q in the top panel. 
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An alternative multiplier, 



-, 4 



F [p{M\yohs), g; Cnovcl] 



(7) 



is plotted in the lower panel of Fig. [2] as a function oi p/q and Cnovci- It emphasizes 
that F saturates as a function oi p/q and Cnovei as either one or both of them grow 
large. A completely new experiment corresponds to Cnovci oo so that l/cnovoi = 
and thus F tends to [tanh(p/g)/tanh(l)]^, i.e., Vpostcrior/Vprior is only determined by 
the quality of the "fit" of the data by the model quantified hy p/q. A finite Cnovci thus 
implies that one already takes a restrained view on the usefulness of the experiment 
since one limits the amplitude of the gain — Vpoaterior/V^rior, whatever the quality of 
the fit of the data by the model. The exponent 4 in ([7| has been chosen so that the 
maximum confidence gain F is equal to tanh(l)~* « 3 in the best possible situation 
of a completely new experiment (cnovci = oo) and perfect fit (p/g ^ co). In contrast, 
the multiplier F can be arbitrarily small as p/q ^ even if the novelty of the test is 
high (Cnovci ^ oo). For a finite novelty Cnovei, a test that fails the model miserably 
{p/q ~ 0) does not necessarily reject the model completely: unlike the expression in 
F remains greater than zero. Indeed, if the novelty Cnovci is small, the worst- 
case multiplier (attained for p/q = 0) is [tanh (1/cnovoi) /tanh (1 -I- 1/cnovci)]* ~ 
1 — 6.9 e~'^'''^"°™' , which is only slightly less than unity if Cnovoi ^ 1. In short, 
this formulation does not heavily weight unimportant tests, as seems intuitively 
appropriate. 

In the framework of decision theory, expression ([1} with one of the specific ex- 
pressions in © or (0 provides a parametric form for the utility or decision "func- 
tion" of the decision maker. It is clear that many other forms of the utility function 
can be used, however, with the constraint of keeping the salient features of expres- 
sion m with ([5} or (O, in terms of the impact of a new test given past tests, and 
the quality of the comparison between the model predictions and the data. This 
indetermination is helpful since it mirrors the inherent variability of the validation 
landscape. For instance, what comprises adequate validation for phenomena at one 
(e.g., macro-)scale may prove inadequate for related phenomena at another (e.g., 
micro-) scale. 

Finally, we remark that the proposed form for the multiplier ([7| contains an 
important asymmetry between gains and losses: the failure to a single test with 
strong novelty and significance cannot be compensated by the success of all the 
other tests combined. In other words, a single test is enough to reject a model. 
This encapsulates the common lore that reputation gain is a slow process requiring 
constancy and tenacity, while its loss can occur suddenly with one single failure and 
is difficult to re-establish. We believe that the same applies to the build-up of trust 
in and, thus, validation of a model. 

See, e.g., the impact of localized seismicity on faults in the case of the Olami- 
Feder-Christensen model discussed below, or that of the "leverage" effect in quan- 
titative finance for the Multifractal Random Walk model described and evaluated 



in Ref. [T]. 
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4 Practical Guidelines for Determining pj q and Cnovei 

These two crucial elements of a validation step are conditioned by four basic prob- 
lems, over which one can exert at least partial control. In particular, they address the 
two sources of uncertainty: "reducible" or epistemic (i.e., due to lack of knowledge) 
and "irreducible" or aleatory (i.e., due to variability inherent in the phenomenon 
under consideration). In a nutshell, as becomes clearer below, the comparison be- 
tween p and q is more concerned with the aleatory uncertainty while Cnovci deals 
in part with the epistemic uncertainty. In the following, as in the two examples (|6} 
and 0, we consider that p and q enter only in the form of their ratio pjq. This 
should not be generally the case but, given the many uncertainties, this restriction 
simplifies the analysis by removing one degree of freedom. 

1. How to model? This addresses model construction and involves the structure of 
the elementary contributions, their hierarchical organization, and requires deal- 
ing with uncertainties and fuzziness. This concerns the epistemic uncertainty. 

2. What to measure? This relates to the nature of Cnovei: ideally, one should target 
adaptively the observations to "sensitive" parts of the system and the model (as, 
e.g.. Palmer et al. [7^ did for atmospheric dynamics). Targeting observations 
could be directed by the desire to access the most "relevant" information as 
well as to get information that is the most reliable, i.e., which is contaminated 
by the smallest errors. This is also the stance of Oberkampf and Trucano [33j : 
"A validation experiment is conducted for the primary purpose of determining 
the validity, or predictive accuracy, of a computational modeling and simulation 
capability. In other words, a validation experiment is designed, executed, and an- 
alyzed for the purpose of quantitatively determining the ability of a mathematical 
model and its embodiment in a computer code to simulate a well-characterized 
physical process. " In practice, we view Cnovoi as an estimate of the importance 
of the new observation and the degree of "surprise" it brings to the validation 
step. Being the cornerstone of our formal approach to validation, we eventually 
want to see its determination grounded in sensitivity and/or PIRT analysis (cf. 
section [LS]). The epistemic uncertainty alluded to above is partially addressed 
in the choice of the empirical data and its rating with Cnovoi (see the examples 
of application discussed below) . 

3. How to measure? FoT given measurements or experiments, the problem is to find 
the "optimal" metric or cost function (involved in the quality-of-fit measure p) 
for the intended use of the model. The notion of optimality needs to be defined. 
It could capture a compromise between fitting best the important features of the 
data (what is "important" may be decided on the basis of previous studies and 
understanding or other processes, or programmatic concerns), and minimizing 
the extraction of spurious information from noise. This requires one to have a 
precise idea of the statistical properties of the noise. If such knowledge is not 
available, the cost function should be chosen accordingly. The choice of the cost 
function involves the choice of how to look at the data. For instance, one may 
want to expand the measurements at multiple scales using wavelet decomposi- 
tions and compare the prediction and observations scale by scale, or in terms of 
multifractal spectra of the physical fields estimated from these wavelet decompo- 
sitions [73] or from other methods. The general idea here is that, given complex 
observation fields, it is appropriate to unfold the data on a variety of "metrics," 
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which can then be used in the comparison between observations and model pre- 
dictions. The question is then: How well is the model able to reproduce the 
salient multi-scale and multifractal properties derived from the observations? 
The physics of turbulent fields and of complex systems have offered many such 
new tools with which to unfold complex fields according to different statistics. 
Each of these statistics offers a metric to compare observations with model pre- 
dictions and is associated with a cost function focusing on a particular feature 
of the process. Since these metrics are derived from the understanding that tur- 
bulent fields can be analyzed using these metrics that reveal strong constraints 
in their organization, these metrics can justifiably be called "physics-based." In 
practice, p, and eventually p/g, has to be inferred as an estimate of the degree 
of matching between the model output and the observation. This can be done 
following the concept of fuzzy logic in which one replaces the yes/no pass test 
by a more gradual quantification of matching [741 175] . We thus concur with 
Oberkampf and Barone [76j, while our general methodology goes beyond. Note 
that this discussion relates primarily to the aleatory uncertainty. 
4. How to interpret the results? This question relates to defining the test and 
the reference probability level q that any other model (than the one under 
scrutiny) can explain the data. The interpretation of the results should aim at 
detecting the "dimensions" that are missing, misrepresented or erroneous in the 
model (systemic/epistemic uncertainty). What tests can be used to betray the 
existence of hidden degrees of freedom and/or dimensions? This is the hardest 
problem. It can sometimes possess an elegant solution when a given model is 
embedded in a more general one. Then, the limitation of the more restricted 
model becomes clear from the vantage of the more general model. 

We refer to the Appendix for further thoughts on these four basic steps in model 
construction and validation in a broader context than our present formulation. 

We now illustrate our algorithmic approach to model validation using the histor- 
ical development of quantum mechanics and three examples based on the authors' 
research activities. In these crude but revealing examples, we will use the form ((7| 
and consider three finite values: Cnovoi ~ 1 (marginally useful new test), Cnovoi = 10 
(substantially new test), and Cnovoi ~ 100 (important new test). When a likelihood 
test is not available, we propose to use three possible marks: p/q — 0.1 (poor fit), 
p/q = 1 (marginally good fit), and p/q = 10 (good fit). Extreme values (cnovoi or 
p/q are or oo) have already been discussed. Due to limited experience with this 
approach, we propose these ad hoc values in the following examples of its application. 

5 Illustration with the Development of Quantum 
Mechanics 

Quantum mechanics (QM) offer a vivid incarnation of how a model can turn pro- 
gressively into a theory held "true" by almost all physicists. Since its birth, QM has 
been tested again and again because it presents a view of "reality" that is shockingly 
different from the classical view experienced at the macroscopic scale. QM prescrip- 
tions and predictions often go against (classically-trained) intuition. Nevertheless, 
we can state that, by a long and thorough process of confirmed predictions of QM in 
experiments, fueled by the imaginative set-up of paradoxes, QM has been validated 



24 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide 

as a correct description of nature. It is fair to say that the overwhelming majority of 
physicists have developed a strong trust in the vahdity of QM. That is, if someone 
comes up with a new test based on a new paradox, most physicists would bet that 
QM will come up with the right answer with a very high probability. It is thus by 
the on-going testing and the compatibility of the prediction of QM with the obser- 
vations that QM has been validated. As a consequence, one can use it with strong 
confidence to make predictions in novel directions. This is ideally the situation one 
would like to attain for the problem of validation of all models, those discussed in 
the following section in particular. We now give a very partial list of selected tests 
that established the trust of physicists in QM. 

1. Pauli's exclusion principle states that no two identical fermions (particles with 
non-integer values of spin) may occupy the same quantum state simultaneously 
[77) . It is one of the most important principles in quantum physics, primarily 
because the three types of particle from which ordinary matter is made, elec- 
trons, protons, and neutrons, are all subject to it. With Cnovci ~ 100 and perfect 
agreement in numerous experiments {p/q = oo), this leads to = 2.9. 

2. The EPR paradox _78j was a thought experiment designed to prove that quan- 
tum mechanics was hopelessly flawed: according to QM, a measurement per- 
formed on one part of a quantum system can have an instantaneous effect on the 
result of a measurement performed on another part, regardless of the distance 
separating the two parts. Bell's theorem [79] showed that quantum mechanics 
predicted stronger statistical correlations between entangled particles than the 
so-called local realistic theory with hidden variables. The importance of this 
prediction requires Cnovci = 100 at a minimum. The QM prediction turned out 
to be correct, winning over the hidden- variables theories [801 181[ {p/q ~ oo), 
leading again to F'^' = 2.9. 

3. The Aharonov-Bohm effect predicts that a magnetic field can infiuence an elec- 
tron that, strictly speaking, is located completely beyond the field's range, 
again an impossibility according to non-quantum theories (cnovei = 100). The 
Aharonov-Bohm oscillations were observed in ordinary (i.e., not superconduct- 
ing) metallic rings, showing that electrons can maintain quantum mechanical 
phase coherence in ordinary materials [821 183) . This yields p/q = oo and thus 

= 2.9 yet again. 

4. The Josephson effect provides a macroscopic incarnation of quantum effects 
in which two superconductors are predicted to preserve their long-range order 
across an insulating barrier, for instance, leading to rapid alternating currents 
when a steady voltage is applied across the superconductors. The novelty of this 
effect again warrants Cnovci = 100 and the numerous verifications and applica- 
tions (for instance in SQUIDs, Superconducting QUantum Interference Devices) 
argues for p/q = oo and thus F'**' = 2.9, as usual. 

5. The prediction of possible collapse of a gas of atoms at low temperature into 
a single quantum state is known as Bose-Einstein condensation, again so much 
against classical intuition (cnovci = 100). Atoms are indeed bosons (particles 
with integer values of spin), which are not subjected to the Pauli exclusion 
principle evoked in the above test #1 of QM. The first such Bose-Einstein 
condensate was produced using a gas of rubidium atoms cooled to 1.7 • 10"^ K 
[84] {p/q = oo), leading once more to F'*' = 2.9. 

6. There have been several attempts to develop a paradox-free nonlinear QM the- 
ory, in the hope of eliminating Schrodinger's cat paradox, among other em- 
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barrassments. The nonlinear QM predictions diverge from those of orthodox 
quantum physics, albeit subtly. For instance, if a neutron impinges on two slits, 
an interference pattern appears, which should, however, disappear if the mea- 
surement is made far enough away (cnovci ~ 100). Experiment tests of the 
neutron prediction rejected the nonlinear version in favor of the standard QM 
[85] {p/q = oo), leading to F'-'^^ = 2.9. 
7. In addition, measurements at the National Institute of Standards and Technol- 
ogy (NIST) in Boulder, CO, on frequency standards have been shown to set 
limits of order 10~^^ on the fraction of the energy of the rf transition in ^Be 
ions that could be due to nonlinear corrections to quantum mechanics [SS] . We 
assign Cnovci = 10, with p/q = 10), to this result, leading to F'^-* = 2.4. Although 
less than f''^~^\ this is still an impressive score. 

Combining the multipliers according to (jS)) leads to V^poitorior/^priL — 1400, which 
is of course only a lower limit given the many other validation tests not mentioned 
here. 

Tests of QM are ongoing [ST. But given the presumably huge amount of trust 
physicists have in QM which we tried to quantify, why do physicists still feel the 
need to put QM to the "validation" test? This raises the question whether we can 
ever establish a sense of sufficiency for validation. Our position is that this reflects a 
quixotic quest for absolute truth — and also a taste for surprises — that most scientists 
can relate to. Perhaps, by continuing to test QM, a new insight or an anomaly will 
be uncovered which may help progress in the understanding of reality. 

6 Three Examples Drawn from the Authors' Research 
Interests 

6.1 The Olami-Feder-Christensen (OFC) Sand-Pile Model of 
Earthquakes 

This is perhaps the simplest sand-pile model of self-organized criticality, which ex- 
hibits a phenomenology resembling real seismicity |88j . Figure |3] shows a "stress" 
map generated by the OFC model immediately after a large avalanche (main shock) 
at two magnifications, to illustrate the rich organization of almost synchronized re- 
gions [Sn]. To validate the OFC model, we examine the properties and prediction 
of the model that can be compared with real seismicity, together with our assess- 
ment of their Cnovci and quality-of-fit. We are careful to state these properties in an 
ordered way, as specified in the above sequences (H))-©. 

1. The statistical physics community recognized the discovery of the OFC model 
as an important step in the development of a theory of earthquakes: without 
a conservation law (which was thought before to be an essential condition), 
it nevertheless exhibits a power law distribution of avalanche sizes resembling 
the Gutenberg-Richter law [SS]. On the other hand, many other models with 
different mechanisms can explain observed power law distributions [91] . We thus 
attribute only Cnovei = 10 to this evidence. Because the power law distribution 
obtained by the model is of excellent quality for a certain parameter value 
(q ~ 0.2), we formally take p/q — oo (perfect fit). Expression ([TJ then gives 
= 2.4. 



26 



D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide 



1 

200 




400 j 




6001 




800 i 








10001 








410( 
420 1 




430 




4401 




450 




460 1 

470 ' 





660 680 700 



0.2 0.4 0.6 0.8 1 

Fig. 3. Map of the "stress" field generated by the OFC model immediately after 
a large avalanche (main shock) at two magnifications. The upper panel shows the 
whole grid of size 1024 and the lower plot represents a subset of the grid delineated 
by the square in the upper plot. Adapted from Ref. [90] . 



2. Prediction of the OFC model concerning foreshocks and aftershocks, and their 
exponents for the inverse and direct Omori laws. These predictions are twofold 
[90] : (i) the finding of foreshocks and aftershocks with similar qualitative prop- 
erties, and (ii) their inverse and direct Omori rates. The first aspect, deserves 
a large Cnovei = 100 as the observation of foreshocks and aftershocks came as 
a rather big surprise in such sand-pile models [92]. The clustering in time and 
space of the foreshocks and aftershocks are qualitatively similar to real seismic- 
ity [90], which warrants p/q = 10, and thus f = 2.9. The second aspect 
is secondary compared with the first one (cnovoi ~ 1). Since the exponents are 
only qualitatively reproduced (but with no formal likelihood test available), we 
therefore take p/q = 0.1. This leads to F'-'^''^ = 0.47. 

3. Scaling of the number of aftershocks with the main shock size (productivity 
law) ^U\: Cnovei ~ 10 as this observation is rather new but not completely 
independent of the Omori law. The fit is good so we grant a grade p/q — 10 
leading to F'^' = 2.4. 

4. Power law increase of the number of foreshocks with the main shock size [90] : 
this is not observed in real seismicity, probably because this property is absent 
or perhaps due to a lack of quality data. This test is therefore not very selective 
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(cnovci ~ 1) and the large uncertainties suggest a grade p/q = 1 (to reflect the 
different viewpoints on the absence of effect in real data) leading to = f 
(neutral test). 

5. Most aftershocks are found to nucleate at "asperities" located on the main 
shock rupture plane or on the boundary of the avalanche, in agreement with 
observations [90]: Cnovci = 10 and p/q = 10 leading to F^^^ = 2.4. 

6. Earthquakes cluster on spatially localized geometrical structures known as 
faults. This property is arguably central to the physics of seismicity (Cnovci = 
100), but absolutely not reproduced by the OFC model (p/q = 0.1). This leads 
to = 4 ■ 10"*. 

Combining the multipliers according to up to test #5 leads to V'poitorior/^prVor ~ 
18.8, suggesting that the OFC model is validated as a useful model of the statistical 
properties of seismic catalogs, at least with respect to the properties which have 
been examined in these first five tests. Adding the crucial last test strongly fails the 
model since V^poiterior/^priL ~ 7.5 10~^. The model can not be used as a realistic 
predictor of seismicity. The results of our quantitative validation process indicate 
that it can nevertheless be useful to illustrate certain statistical properties and to 
help formulate new questions and hypotheses. 

6.2 An Anomalous Diffusion Model for Solar Radiation in Cloudy 
Atmospheres 

To improve our modeling skill for climate dynamics, it is essential to reduce the sig- 
nificant uncertainty associated with clouds. In particular, estimation of the radiation 
budget in the presence of clouds needs improvement since current operational mod- 
els for the most part ignore all variability below the scale of the climate model's grid 
(~100 km). A considerable effort has therefore been expended to derive more realistic 
mean-field radiative transfer models |93) . mostly by considering only the one-point 
variability of clouds, that is, irrespective of their actual structure as captured by 
2-point (or higher) correlation statistics. However, it has been widely recognized 
that the Earth's cloudiness is fractal over a wide range of scales [94]. This is the 
motivation for modeling the paths of solar photons at non-absorbing wavelengths 
in the cloudy atmosphere as Levy walks [91| . which are characterized by frequent 
small steps (inside clouds) and occasional large jumps (typically between clouds) as 
represented schematically in Fig. [4] These (on-average downward) paths start at the 
top of the highest clouds and end in escape to space or in absorption at the surface, 
respectively, cooling and warming the climate system. In contrast with most other 
mean-field models for solar radiative transfer, this diffusion model with anomalous 
scaling can be subjected to a battery of observational tests. 

1. The original goal of this phenomenological model, which accounts for the clus- 
tering of cloud water droplets into broken and/or multi-layered cloudiness, was 
to predict the increase in steady-state flux transmitted to the surface compared 
to what would filter through a fixed amount of condensed water in a single 
unbroken cloud layer |96j . This property is common to all mean-field photon 
transport models that do anything at all about unresolved variability 153 . Thus, 
we assign only Cnovei = 1 to this test and, given that all models in this class are 
successful, we have to take p/q = 1, hence F'^-* — 1. The outcome of this first 
test is neutral. 
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Fig. 4. Scliematic representation of the anomalous diffusion model of solar photon 
transport at non-absorbing wavelengths in the cloudy atmosphere. In this model, 
solar beams follow convoluted Levy walks, which are characterized by frequent small 
steps (inside clouds) and occasional large jumps (between clouds or between clouds 
and the surface). The partition between small and large jumps is controlled by the 
Levy index a (the PDF of the jump sizes I has a tail decaying as a power law 
- 1/^1+"). Reproduced from Ref. [55]. 



2. The first real test for this model occurred in the late 1990s, when it became possi- 
ble to accurately estimate the mean total path cumulated by solar radiation that 
reaches the surface. This breakthrough was enabled by access to spectroscopy 
at medium (high) resolution of oxygen bands (lines) [971198) . There was already 
remote sensing technology to infer simultaneously cloud optical depth, which is 
column-integrated water in g (or cm'') per cm^ multiplied by the average cross- 
section for scattering or absorption in cm^ per g (or cm^). The observed trends 
between mean path and optical depth were explained only by the new model 
in spite of relatively large instrumental error bars. So we assign Cnovci = 100 to 
this highly discriminating test and p/q = 10 (even though other models were 
generally not in a position to compete), hence F'^'^^ = 2.9. 

3. Another test was proposed using time-dependent photon transport with a source 
near the surface (cloud-to-ground lightning) and a detector in space (aboard the 
US DOE FORTE satellite) The quantity of interest is the observed delay 
of the light pulse (due to multiple scattering in the cloud system) with respect 
to the radio- frequency pulse (which travels in a straight line). There was no 
simultaneous estimate of cloud optical depth, so assumptions had to be made 
(informed by the fact that storm clouds are at once thick and dense). Because 
of this lack of an independent measurement, we assign only Cnovoi = 10 to the 
observation and p/q = 1 to the model performance. Indeed, this test is arguably 
only about the finite horizontal extent of the rain clouds resulting from deep 
convection: one can exclude only most simplistic cloud models based on uniform 
plane-parallel slabs. So, again we obtain F'^^^ = 1 for an interesting but presently 
neutral test that needs refinement. 
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4. Min et al. |100| developed an oxygen-line spectrometer with sufficient resolution 
to estimate not just the mean path but also its root-mean-square (RMS) value. 
They found the prediction by Davis and Marshak [101] for normal diffusion to 
be an extreme (envelop) case for the empirical scatter plot of mean vs. RMS 
path, and this is indicative that the anomalous diffusion model will cover the 
bulk of the data. Because of some overlap with item #2, we assign Cnovei = 10 to 
the test and p/q = 10 for the model performance since the anomalous diffusion 
model had not yet made a prediction for the RMS path (although we note that 
other models have yet to make one for the mean path). We therefore receive 

= 2.4. 

5. Using similar data but a different normalization than Min et al., more amenable 
to model testing, SchoU et al. [102) observed that the RMS-to-mean ratio for 
solar photon path is essentially constant whether the cloud structure (according 
to mm-wave radar profiles) is complex or not (respectively, diffusion is nor- 
mal or anomalous). This is a remarkable empirical finding to which we assign 
Cnovei = 100. The new mean- and RMS-path data was explained by SchoU et 
al. by creating an ad hoc hybrid between normal diffusion theory (which in- 
deed has a prediction for the RMS path [101) ) and its anomalous counterpart 
(which still has none). This modification of the basic model can be viewed as 
significant, meaning that we are in principle back to validation step 1 with the 
new model. However, this exercise uncovered something quite telling about the 
original anomalous diffusion model, namely, that its simple asymptotic (large 
optical depth) form used in all the above tests is not generally valid: for typical 
cloud covers, the pre-asymptotic terms computed explicitly for the normal dif- 
fusion case prove to be important irrespective of whether the diffusion is normal 
or not. Consequently, in its original form (resulting in a simple scaling law for 
the mean path with respect to cloud thickness and optical depth), the anoma- 
lous diffusion model fails to reproduce the new data even for the mean path. 
(Consequently, previous fits yielded only "effective" anomaly parameters and 
were misleading if taken too literally.) So we assign p/q — 0.1 at best for the 
original model, hence F^-^^ = 4 lO"''. 

Thus, Vp^lf.^j.^^j./Vp^^^^ = 3 10~^, a fatal blow for the anomalous diffusion in its sim- 
ple asymptotic form, even though ^^poiterior/^prVor = '''■0 which would have been 
interpreted as close to a convincing validation. 

This is not the end of the story, of course. The original model has already 
spawned SchoU et al.'s empirical hybrid and a formalism based on integral (in fact, 
pseudo-differential) operators has been proposed )103) that extends the anomalous 
diffusion model to pre-asymptotic regimes. More recently, a model for anomalous 
transport (i.e., where angular details matter) has been proposed that fits all of the 
new oxygen spectroscopy results [95]. 

In summary, the first and simplest incarnation of the anomalous diffusion model 
for solar photon transport ran its course and demonstrated the power of oxygen-line 
spectroscopy as a test for the performance of radiative transfer models required in 
cMmate modeling for large-scale average responses to solar illumination. Eventually, 
new and interesting tests will become feasible when we obtain dedicated oxygen-line 
spectroscopy from space with NASA's Orbiting Carbon Observatory (OCO) mission 
planned for launch in 2008. Indeed, we already know that the asymptotic scaling for 
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reflected photon paths [104j is difi^erent from their transmitted counterparts [101| in 
standard difi'usion theory for both mean and RMS. 

6.3 A Computational Fluid Dynamics Model for Shock-Induced 
Mixing 

So far, our examples of models for complex phenomena have hailed from quantum 
and statistical physics. In the latter case, they are stochastic models composed of: 
(1) simple code (hence rather trivial verification procedures) to generate realizations, 
and (2) analytical expressions for the ensemble-average properties (that are used in 
the above validation exercises). We now turn to gas dynamics codes which have a 
broad range of applications, from astrophysical and geophysical flow simulation to 
the design and performance analysis of engineering systems. Specifically, we discuss 
the validation of the "Cuervo" code developed at Los Alamos National Laboratory 
[1051 1106] for use as a simulation tool in the complex physics of compressible mix- 
ing. This software generates solutions of the Euler equations for fiows of inviscid, 
non-heat-conducting, compressible gas. Cuervo has been verified against a suite of 
test problems including, e.g., those discussed by Liska and Wendroff [107| . As clearly 
stated by Oberkampf and Trucano [33] however, such verification differs from and 
does not guarantee validation against experimental data. A standard validation sce- 
nario involves the Richtmyer-Meshkov (RM) instability |1081I109] . which arises when 
a density gradient in a fiuid is subjected to an impulsive acceleration, e.g., due to 
passage of a shock wave (see Fig.[5]|. Evolution of the RM instability is nonlinear and 
hydrodynamically complex and hence defines an excellent problem-space to assess 
CFD code performance for more general mixing scenarios. 



Fig. 5. Schematic of the interactions between weakly shocked (Mach number «1.2) 
light gas (air) and a column of dense gas (SFg). The Richtmyer-Meshkov instability 
occurs from the mismatch between the pressure gradient (at the shock front) and 
the density gradient (between the light and dense gases), which acts as a source of 
baroclinic vorticity. The column of dense gas "rolls up" into a double-spiral form 
under the action of the evolving vorticity. 



Pre-shock 



Post-shock 




In the series of shock-tube experiments described in [110] . RM dynamics are real- 
ized by preparing one or more cylinders with approximately identical axisymmetric 
Gaussian concentration profiles of dense sulfur hexaflouride (SFe) in air. This (or 
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these) vertical "gas cylinder(s)" is (are) subjected to a weak shock — Mach number 
~1.2 — propagating horizontally, i.e., perpendicular to the axis of the gas cylinders. 
The ensuing dynamics are largely governed by the mismatch of the density gradi- 
ent between the gases (with the density of SFg approximately five times that of 
air) and the pressure gradient through the shock wave; this mismatch acts as the 
source for baroclinic vorticity generation. Moreover, the flow evolution is strongly 
two-dimensional up to the final times considered. Visualization of the density field is 
obtained using a planar laser-induced fiuorescence (PLIF) technique, which provides 
high-resolution quantitative concentration measurements in a plane that cross-cuts 
the cylinders. The velocity field is diagnosed using particle image velocimetry (PIV), 
based on correlation measurements of small-scale particles that are seeded in the ini- 
tial flow field. Careful post-processing of images from 130 /is to 1000 /is after shock 
passage yields planar concentration and velocity with error bars. 

1. This RM flow is dominated at early times by a vortex pair. Later, secondary 
instabilities rapidly transition the flow to a mixed state. We rate Cnovci ~ 10 for 
the observations of these two instabilities. The Cuervo code correctly captures 
these two instabilities, best observed and modeled with a single cylinder. At this 
qualitative level, we rate p/q = 10 (good fit), which leads to -F'^' = 2.4. 

2. Older data for two-cylinder experiments acquired with a fog-based technique 
(rather than PLIF) showed two separated spirals associated with the primary 
instability, but the Cuervo code predicted the existence of a material bridge be- 
tween those structures. This previously unobserved connection was subsequently 
diagnosed experimentally with the improved observational technique, i.e., the 
simulation code was truly predictive of this phenomenon. Using Cnovci ~ 10 and 
p/q = 10 yields F^^^ = 2.4. 

3. The evolution of the total power as a function of time ofi^ers another useful met- 
ric. The numerical simulation quantitatively accounts for the exponential growth 
of the power with time, within the experimental error bars. Using Cnovci = 10 
and p/q = 10 yields F'^*' = 2.4. 

4. The concentration power spectrum as a function of wavenumber for different 
times provides another way (in the Fourier domain) to present the information of 
the hierarchy of structures already visualized in physical space (cnovci = 1)- The 
Cuervo code correctly accounts for the low wavenumber part of the spectrum but 
underestimates the high wavenumber part (beyond the deterministic-stochastic 
transition wavenumber) by a factor 2 to 5. We capture this by setting p/q = 0.1, 
which yields F*'*' = 0.47. 

Combining the multipliers according to ((3)1 leads to V^Jf]tcrior/''^prVor = 6.5, a signifi- 
cant gain, but still not sufficient to compellingly validate the Cuervo code for inviscid 
shock- induced hydrodynamic instability simulations, at least in 2D. Clearly, valida- 
tion against this single set of experiments is inadequate to address all intended uses 
of a CFD code such as Cuervo. 

6.4 Discussion 

The above three examples illustrate the utility of representing the validation process 
as a succession of steps, each of them characterized by the two parameters Cnovci and 

Intricate experiments with three gas cylinders have since been performed [111) 
and others are currently under way to further challenge compressible flow codes. 



32 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide 



p/q. The determination of Cnovoi requires expert judgment and that of p/q a careful 
statistical analysis, which is beyond the scope of the present report (see Ref. |76] 
for a detailed case study). The parameter q is ideally imposed as a confidence level, 
say 95% or 99% as in standard statistical tests. In practice, it may depend on the 
experimental test and requires a case-by-case examination. 

The uncertainties of Cnovoi and of p/q need to be assessed. Indeed, different 
statistical estimations or metrics may yield different p/g's and different experts will 
likely rate differently the novelty Cnovei of a new test. As a result, the trust gain 
^prsterior/^prior ^fter 71 tests uecessarily has a range of possible values that grows 
geometrically with n. In certain cases, a drastic difference can be obtained by a 
change of Cnovci. For instance, if instead of attributing Cnovci = 100 to the sixth OFC 
test, we put Cnovoi = 10 (resp. 1) while keepingp/g = 0.1, F'^-* is changed from 4-10"'* 
to 4 ■ 10"^ (resp. 0.47). The trust gain then becomes V^2t^r\ojK^L = 0.07 (resp. 
~ 9). For the sixth OFC test, Cnovoi — 1 is a,r guably unrealistic, given the importance 
of faults in seismology. The two possible choices Cnovei = 100 and Cnovoi = 10 then 
give similar conclusions on the invalidation of the OFC model. In our examples, 
^prateVior/^prVor providcs a qualitatively robust measure of the gain in trust after n 
steps; this robustness has been built-in by imposing a coarse-grained quality to p/q 

and Cnovei. 



7 Summary 

The validation of numerical simulations continues to become more important as 
computational power grows, as the complexity of modeled systems increases, and 
as increasingly important decisions are influenced by computational models. We 
have proposed an iterative, constructive approach to validation using quantitative 
measures and expert knowledge to assess the relative state of validation of a model 
instantiated in a computer code. In this approach, the increase/decrease in validation 
is mediated through a function that incorporates the results of the model vis-a-vis 
the experiment together with a measure of the impact of that experiment on the 
validation process. While this function is not uniquely specified, it is not arbitrary: 
certain asymptotic trends, consistent with heuristically plausible behavior, must be 
observed. In four fundamentally different examples, we have illustrated how this 
approach might apply to a validation process for physics or engineering models. 
We believe that the multiplicative decomposition of trust gains or losses (given in 
Eq. |3]) , using a suitable functional prescription (such as Eq. [7]) , provides a reasoned 
and principled description of the key elements — and fundamental limitations — of 
validation. It should be equally applicable to biological and social sciences, especially 
since it is built upon the decision-making processes of the latter. We believe that 
our procedure transforms the paralyzing criticisms in Popper's style that "we cannot 
validate, we can only invalidate" [19] into a practical constructive algorithm. This 
strategy addresses specifically both problems of distinguishing between competing 
models and transforming the vicious circle of lack of suitable data into a virtuous 
spiral path: each cycle is marked by a quantified increment of the evolving trust we 
put in a model based on the novelty and relevance of new data and the quality of 
fits. 

We have also surveyed and commented extensively on the V&V literature. We 
hope this digest will help the reader as much as its collation helped us deepen our 



A General Strategy for Physics-Based Model Validation 



33 



understanding of the challenge of model validation, including a new perspective on 
some of our own work. We close with these far-reaching thoughts by Patrick J. 
Roache [TT2] : 

In an age of spreading pseudoscience and anti-rationalism, it behooves those 
of us who believe in the good of science and engineering to be above re- 
proach whenever possible. Public confidence is further eroded with every er- 
ror we make. Although many of society's problems can be solved with a 
simple change of values, major issues such as radioactive waste disposal 
and environmental modeling require technological solutions that necessar- 
ily involve computational physics. As Robert Laughlin fliyf noted in this 
magazine, "there is a serious danger of this power [of simulations] being 
misused, either by accident or through deliberate deception." Our intellec- 
tual and moral traditions will be served well by conscientious attention to 
verification of codes, verification of calculations, and validation, including 
the attention given to building new codes or modifying existing codes with 
specific features that enable these activities. 
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Appendix: 

A More Formal Look at the Role of Validation in the 
Modeling Enterprise 

We deal with models that possess two aspects: a conceptual part based on the 
physical laws of nature (such as the Navier-Stokes conservation equations for fluid 
dynamics) and a computational part (like in CFD). Mathematically, a model along 
with observations are defined formally, as described in section lA . 1 1 below: 

• The model Ai maps the set {A} of parameters and of initial and boundary 
conditions to a forecast of state variables in a formal vector Xt; 

• An observation projection Q maps the true dynamics or physics in Xt to raw 
measurements j/o. 

Such definitions may seem abstract and of little use but they are important foun- 
dations to build a comprehensive roadmap for physically-based model validation. 
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In the following section, we refine the above definitions and introduce a few more 
operators and quantities. In section IA.2I we revisit the key steps in a validation 
loop with this notation in hand. Finally, we discuss some fundamental limitations 
on model validation in section IA.3I using some of our own research in time-series 
analysis for illustration. 

A.l Definitions 

Let us denote Xt{r,t) the true physical field. Observations j/o(r,t) are obtained via 
a possibly nonlinear operator Q acting on Xt{r,t): 

y4T,t)^g{X,{r',t')} . (A.l) 

The observations at position r and time t may be a combination of past values 
obtained over some finite region, hence our use of (r',t') which are different from 
{r,t). The operator Q may thus be non-local and (causally) time-dependent. In 
addition, any meeisurement has noise and uncertainties. Therefore, 5 is a stochastic 
operator. The simplest specification beyond ignoring noise is to consider an additive 
noise. 

A model M provides a forecast Xf{r,t) either in the actual future or in terms 
of what will lead (via another operator) to the value of the measurements beyond a 
certain fiducial point in time. This is expressed by 

Xi{r,t) ^ M {{A}) . (A.2) 

M is the model operator, which contains for instance the equation of states, the 
formulation in terms of ODEs, PDEs, discrete maps and so on, which are supposed to 
embody the known physics of the underlying processes. {^1} contains the parameters 
of the model as well as the boundary and initial conditions. The model operator 
M has a non-random part. It can also contain an additive or multiplicative noise 
component to represent the forecast errors as well as possible intrinsic stochastic 
components of the dynamics. The forecast errors may stem from computational 
errors, numerical instabilities and uncertainties, the existence of multiple branches 
in the solution and so on. The simplest specification is again to consider an additive 
noise. 

The output M {{A}) of the model is translated into physical quantities that can 
be compared with the observation via another operator H, which models mathe- 
matically and in code the observation process. In general, one would like to compare 
yo(r,i) given by (|XT|| with H [Xi{r,t)], that is, g{Xt{r',t')} with H [M {{A})]. The 
intended use of the model is key to "objective model validation," because it turns 
"subjectiveness" of the model validation into an "object" using hypothesis testing 
and decision theory. To implement this idea, it is natural to introduce a cost function 
(see below) for the intended use of the model: 

C{g{X,ir',t')};H[M{{A})]) , 

which is a measure of how well the model accounts for the observations. In this 
expression, the cost function is evaluated in the "physical space" of observa- 
tions/measurements. An alternative is to evaluate the cost function in the "model 
space," i.e.. 
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C{g-'{n{X,{r',t')}};M{{A})) , 

where is the formal inverse operator to Q which maps observations i/o onto the 
model space Xf. In data assimilation, explicit form of does not exist in general 
due to rank deficiency. However, such alternative representation within the linear 
theory corresponds to the duality between Kalman filtering and the 3D-Var [114) . 

We propose to define the validation problem as a decision problem in which 
one uses the loss function to infer/decide how much confidence one feels in the 
reliability of the model to function in the range in which it is supposed to apply. 
The interesting and challenging situation occurs when this range extends beyond 
the region of parameter space in which all reasonably stringent controls have been 
performed. Validation requires the build-up of trust in the model or code so that 
it is believed to be resilient and to work in complex real situations combining the 
simple regimes that have been tested. The cost function is just an alternative way 
of constructing the statistical test that provides the probability level p defined in 
the main text. 

A. 2 Four Recurring Types of Problem in Physically-Based Model 
Validation 

Our overarching goal is to advocate approaches to validation that are grounded in 
physics. The term "physics-based" embodies two strategies: 

(a) use physical reasoning to improve modeling, target experiments and loss func- 
tions, and detect missed "dimensions;" 

(b) use concepts from statistical physics to formulate (in the spirit of Brown and 
Sethna |115j ) a validation process of complex models with complex data in the 
form of an A'^-body problem. 

Following this roadmap, we find ourselves asking the same four questions again and 
again: 

1. How to model? (the question of model construction) 

2. What to measure? (the question of estimating Cnovci in the main text) 

3. How to measure it? (the question of choosing and estimating the cost function 
or "metric") 

4. How to interpret the results? (the question of estimating p in the main text) 

We view these four defining questions as the crucial steps within the validation loop 
described in sections 2-4 of the main text. 

Problem 1: Targeting model development (How to model?) 

Our discussion so far may give the impression that the modeling step is "homoge- 
neous." It may actually be advantageous to develop a hierarchical modeling frame- 
work. In this respect, Oden et al. [116j proposed to use hierarchical modeling as a 
mathematical structure that can be useful in directing validation studies. In this 
construction, a class of models of events of interest is defined in which one identifies 
a "fine" model that possesses a level of sophistication high enough to adequately 
capture the event of interest with good accuracy. This model may be intractable, 
even computationally. Hierarchical modeling consists in identifying a family of coarse 
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models that are solvable. Using the fine model output as a datum, the error in the 
solution of ever coarser models can be estimated and controlled, with the goal of 
obtaining a model best suited for the simulation goal at hand. The essential com- 
ponents of this program are the following [116) : 

1. Experimental data are collected to fully characterize the fine model. 

2. Quantities G{X) of interest are specified as the essential physical entity to be 
predicted in the simulation (for instance in the form of the probability of the 
predicted values of the quantity). 

3. The coarsest model is used to extract a preliminary estimate of G{X) and mod- 
eling and approximation errors are computed. 

4. If the estimated error exceeds the prescribed tolerance, the model is enhanced 
and the calculation is repeated until a model yielding results within the preset 
bounds is obtained. 

5. The truncation error of the perturbation expansion is estimated: if the total 
error exceeds a preset tolerance, the data set and the fine model definition must 
be updated; if not, the predicted G{X) and the probability that it will take on 
values in a given interval are produced as output. 

A concrete implementation of this program has been performed by Israeli and 
Goldenfeld [27j . Using elementary cellular automata as an example, Israeli and 
Goldenfeld show how to coarse-grain cellular automata in all categories of Wol- 
fram's exhaustive classification |117] . The main discovery is that computationally 
irreducible physical processes can be predictable and even computationally reducible 
at a coarse-grained level of description. The resulting coarse-grained cellular automa- 
ton constructed with the coarse-graining procedure emulate the large-scale behavior 
of the original systems without accounting for small-scale details. These results re- 
mind us that it is advantageous to develop a view of complex physical processes at 
different scales, as the predictability may depend on the scale of observation. 

A related approach has been discussed recently by Brown and Sethna |115| . who 
consider models defined in terms of a set of nonlinear ODEs applied to systems that 
have large numbers of poorly known parameters, simplified dynamics, and uncertain 
connectivity. They call models possessing these three features, "sloppy models." 
Sloppy models characterize many other high-dimensional multi-parameter nonlinear 
models. Brown and Sethna propose to use the maximum likelihood method to frame 
the problem of parameter estimation and model validation in the form of statistical 
ensemble method. In our language, the problem boils down to a study of the cost 
function C and its stiff and soft directions determined from the eigenvalue problem 
of the Hessian of C (with respect to the parameters of the model). In practice. Brown 
and Sethna propose to estimate the Hessian of C in terms of the so-called "Levenberg- 
Marquardt" Hessian (thus called because of its use of that popular minimization 
algorithm); that quantity is defined simply as a sum of pairwise products of first- 
order derivatives of the residuals with respect to the model parameters. Stiff modes 
correspond to large eigenvalues. Similar to a decomposition in principal components, 
retaining the stiff modes allows one to get a more robust signature of the coarse- 
grained properties of the dynamics. This constitutes a concrete implementation of 
our Problem 4 below on "targeting model errors." This procedure also addresses 
the problem of defining the operator Ti that selects the output of the model for 
comparison to the experimental data. 

There is an interesting avenue for research here: rather than performing the 
principal component decomposition in one step, it may be advantageous to perform 
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a series of sub-system analysis, or cluster analysis, retaining the stiff modes of each 
sub-system and then aggregating them at the next level of the hierarchy. 

Problem 2: Targeting the observations (What to measure?) 

Objective: Find Q (and the associated Ti) that reveals the most about model critical 
behavior. 

The problem has been addressed specifically in these terms by Palmer et al. [72] 
to target adaptive observations to "sensitive" parts of the atmosphere. Targeting 
observations could be directed by the desire to get access to the most relevant in- 
formation that is also the most reliable (e.g., contaminated by the smallest errors). 
It may be worth mentioning that targeting the observations depends not only on 
Q, but also M, {A}, as well as C (along with its own parameters discussed be- 
low) . The targeting of the observations is the problem of maximizing the coefficient 
Cnovci introduced in the main text so that the new experiment/observation explores 
novel dimensions of the parameter and variable spaces of both the process and the 
model that can best reveal potential flaws that could compromise the important 
applications. In general, one targets observations by developing experiments that 
are thought to provide, in some sense, the most relevant tests of the physics. 

Oberkampf and Trucano (2002) [33] suggest that traditional experiments could 
generally be grouped into three categories: 

1. experiments that are conducted primarily for the purpose of improving the 
fundamental understanding of some physical process; 

2. experiments conducted primarily for constructing or improving mathematical 
models of fairly well-understood flows; 

3. experiments that determine or improve the reliability, performance, or safety of 
components, subsystems, or complete systems. 

These authors argue that validation experiments constitute a fourth type of experi- 
ment: "A validation experiment is conducted for the primary purpose of determining 
the validity, or predictive accuracy, of a computational modeling and simulation ca- 
pability. In other words, a validation experiment is designed, executed, and analyzed 
for the purpose of quantitatively determining the ability of a mathematical model 
and its embodiment in a computer code to simulate a well- characterized physical 
process. " This leads them to propose the following guidelines: 

• Guideline #_/: A validation experiment should be jointly designed by experi- 
mentalists, model developers, code developers, and code users working closely 
together throughout the program, from inception to documentation, with com- 
plete candor about the strengths and weaknesses of each approach. 

• Guideline #2: A validation experiment should be designed to capture the essen- 
tial physics of interest, including all relevant physical modeling data and initial 
and boundary conditions required by the code. 

• Guideline #5: A validation experiment should strive to emphasize the inherent 
synergism between computational and experimental approaches. 

• Guideline #^: Although the experimental design should be developed coopera- 
tively, independence must be maintained in obtaining both the computational 
and experimental results. 
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• Guideline #5: A hierarchy of experimental measurements of increasing compu- 
tational difficulty and specificity should be made, for example, from globally 
integrated quantities to local measurements. 

• Guideline #6: The experimental design should be constructed to analyze and es- 
timate the components of random (precision) and bias (systematic) experimental 
errors. 

Problem 3: Targeting the cost function (How to estimate the 
penalty on imperfect models and measurements using their 
discrepancies?) 

For given measurements or experiments, that is, for given Q, the problem is to 
find the optimal cost function C for the intended use of the model. The notion of 
optimality needs to be defined. It could capture a compromise between the following 
requirements; 

• fit best the important features of the data (what is "important" may be de- 
cided on the basis of previous studies and understanding or other processes, or 
programmatic concerns) ; 

• minimize the extraction of spurious information from noise, which requires one 
to have a precise idea of the statistical properties of the noise (if such knowledge 
is not available, the cost function should take this into account). 

The choice of the cost function involves the choice of how to look at the data. 
For instance, one may want to expand the measurements at multiple scales using 
wavelet decompositions and compare the prediction and observations scale by scale, 
or in terms of multifractal spectra of the physical fields estimated from these wavelet 
decomposition or from other methods. The general idea here is that, given complex 
observation fields, it is appropriate to "project" the data onto a variety of "metrics" 
designed to detect and characterize phenomena of particular interest. For instance, 
wavelet-based scaling properties can be used in the comparison between observa- 
tions and model predictions; the question is then: How well is the model/code able 
to reproduce the salient multi-scale properties derived from the observations? The 
physics of turbulent fields and of complex systems have offered many such new 
tools to unfold complex fields according to different statistics. Each of these statis- 
tics provides a basis for a metric to compare observations with model predictions. 
Each such statistics thus leads to a cost function focusing on a particular feature of 
the process. These metrics are derived from the understanding that turbulent fields 
can be analyzed using them, revealing strong constraints in their organization (spa- 
tial structure and temporal evolution). These metrics can therefore be described as 
"physics-based." 

Furthermore, the choice of the cost function should take into account that the 
diagnostics of the experiments may lead to spurious results [Tl]. For example, in 
laser-driven shock experiments, because the laser-induced fluorescence method illu- 
minates the mixing zone with a planar sheet of light, this diagnostic can lead to 
aliasing of long-wavelength structures into short-wavelength features in the images, 
thus affecting the interpretation of observed small-scale structures in the mixing 
zone. Also, because of the dynamic limits on diagnostic resolution, the formation of 
small-scale structure cannot be completely determined. 
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As emphasized by Noam Chomsky in his own field of work ilT8j , the danger with 
the Popperian strategy [119) is that one might prematurely reject a theory based on 
"falsification" using data that are themselves poorly understood. For instance, lack of 
quality control for the experiments can result in premature rejection of the model. On 
these issues, Stein [120j discusses means for controlling and for understanding sample 
selection and variability, which can compromise conclusions drawn from validation 
tests. 

The problem of the choice of the cost function C seems, however, to be of less 
importance than Problem 2 above and Problem 4 below. In fact, almost all classical 
results on the limit properties of efficiency of statistical inference are valid (and 
proved) for a whole general family of cost functions C(-; •) satisfying the following 
conditions (see, e.g., Ibragimov and Hasminskii [121] 1: 

(a) C{x,y) = c{\x - y\); 

(b) c{z) is a positive monotonically increasing function (including, e.g., power-law 
functions l^l', with g > 0); 

(c) c{z) should not increase too fast (its mean with respect to the Gaussian distri- 
bution must remain finite). 

Thus, statistical limit theorems are proved for the whole class of different power-law 
cost functions (including the classic choice q — 2). 

As an example, it may be appropriate to consider the cost function in the fol- 
lowing form. Let us assume we are interested in some functional 

Z{R,T\g{Xt{r,t)},r € D{R),t < T) 

depending on the past true physical field Xt{r,t) in some region D{R). In this case, 
the cost function can be chosen as 

C (Z [R, T\g{Xt (r, t)},re D{R),t <T];Z [R, T\n{X{{r, t)}, r G D{R),t < T]) 

(A.3) 

where C(-; ■) is some function satisfying above conditions (a)-(c). The formulation 
l|A.3|) for C(-; ■) should not only be a function of Q and M, but also of those pa- 
rameters that correspond to our best guess for the uncertainties, errors and noise. 
Indeed, in most cases, we can never know real uncertainties, errors and noise in Q 
and M (or even H). Hence, we must parameterize them based on our best guess. In 
data assimilation (described in the main text in relation to model calibration and 
validation) , the accuracy of such parameterization is known to influence the results 
significantly. 

Generalizations to ()A.3|I allowing for different fields in the two sets of variables in 
C are needed for some problems, such as in validation of meteorological models. For 
instance, consider a model state vector X (dimension is on the order of 10^) which 
is computed on a fixed spatial grid. In general, the locations of the observations are 
not on the computational grid (for example, consider measurements with weather 
balloons released from the surface). Thus, the observation F is a function of X, but 
is not an attempt to estimate X itself. Hence, if the cost function is quadratic, it has 
the form {Y - H{X)fO-^{Y - H{X)) where H acts on the interpolation function 
to pick up the model variable at the grid points close to the observed location, 
and O is related to the error covariance. Let us imagine a validation case using 
satellite infrared images for Y and atmospheric radiative state for X. Observations 
are quasi-uniform in space at a given time; at each time, available observations and 
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their quality (represented by O) may change, however. In this case, the cost function 
must take into account the mapping between X and Y so that we have C{X, Y) — 
C{\H{X)-Y\) rather than C(X,y) = C(\X-Y\); therefore {Y - H{X)fO-^{Y - 
H{X)) when C is quadratic. In addition, for heterogeneous observations (satellite 
images, weather balloon measurements, airplane sampling, and so on), cost functions 
should take into account all these data into account such as 

C{x, y) = Csatciiitc(a;, y) + Cbaiiooii(2;, y) + Cairpianc(3;, y) H 

and each C may have a complex idiosyncratic observation function H. See Courtier 
et al. [122) for a discussion on cost functions for atmospheric models and observation 
systems. 

Problem 4: Targeting model errors (How to interpret the results?) 

The problem here is to find the "dimensions" of the model that are missing, misrep- 
resented or erroneous. The question of how to interpret the results thus leads to the 
discussion of the missing or misrepresented elements in the model. What tests can 
be used to betray the existence of hidden degrees of freedom and/or dimensions? 

This is the hardest problem of all. It can sometimes find an elegant solution 
when a given model is embedded in a more general model. Then, the limitation 
of the "small" model becomes clear from the vantage of the more general model. 
Well-known examples are 

• Newtonian mechanics as part of special relativity, when v <^ c where v (resp. c) 
is the velocity of the body (resp. of light); 

• classical mechanics as part of quantum mechanics when h/mc <C L (where h is 
Planck's constant, m and L are the mass and size of the body and h/mc is the 
associated Compton wavelength); 

• Eulerian hydrodynamics as part of Navier-Stokes hydrodynamics with its rich 
phenomenology of turbulent motion (when the Reynolds number goes to infinity, 
equivalently, viscosity goes to zero); 

• classical thermodynamics as part of statistical physics of S> 1 particles or 
elements, where phase transitions and thermodynamic phases emerge in the 
limit N —> oo. 

The challenge of targeting model errors is to develop diagnostics of missing dimen- 
sions even in absence of a more encompassing model. This could be done by adding 
random new dimensions to the model and studying its robustness. 

In what sense can one detect that a model is missing some essential ingredient, 
some crucial mechanisms, or that the number of variables or dimensions is inade- 
quate? To use a metaphor, this question is similar to asking ants living and walking 
on a plane to gain awareness that there is a third dimension. 

This question (raised already by the German philosopher Kant) actually has 
an answer that has been studied and solved by Ehrenfest in 1917 [123) (see also 
Whitrow's 1956 article )124) '). This answer is based on the analysis of several 
fundamental physical laws in R" spaces and comparing their predictions as a 
function of n. The value n — 3 turns out to be very special! Thus, ants studying 
gravitation or electro-magnetic fields will see that there is more to space than 
their plane. 
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A. 3 Fundamental Limits on Model Validation 

Before, while and after engaging in model validation, it is wise to reflect frequently 
and systematically on what is not known. Two examples using the formalization 
introduced in section lA.ll are: 

Ignorance on the model 

As quoted in the main text, Roache [2] states, in a nutshell, that validation is about 
solving the right equations for the problem of immediate concern. How do we know 
the right equations? 

Consider, for instance, point vortex models, and let us perform "twin experi- 
ments," i.e., (1) first generate the "simulated observations" by a "true" point vortex 
system that are unknown to the make-believe observer and modeler; (2) use the pro- 
cedure of section lA.il and construct a "validated" point vortex system. The problem 
is that, even before we start model validation, we are already using one of the most 
critical pieces of information, which is that the system is based on point vortices. 
Similar criticism for the use of "simulated observations" has been raised in data as- 
similation studies using OSSEs (Observing-System Simulation Experiments). This 
criticism is crucial for model validation. 

For this unavoidable issue of model errors, we suggest that one needs a hierarchy 
of investigations: 

1. Look at the statistical or global properties of the time series and/or fields gen- 
erated by the models as well as from the data, such as distributions, correlation 
functions, n-point statistics, fractal and multifractal properties of the attractors 
and emergent structures, in order to characterize how much of the data our 
model fits. Part of this approach is the use of maximum likelihood theory to 
determine the most probable value of the parameters of the model, conditioned 
on the realization of the time series. 

2. We can bring to bear on the problem the modern methods of computational 
intelligence (or machine learning), including pattern classification and recog- 
nition methods ranging from the already classical ones (e.g., neural networks, 
if -means) to the most recent advances (e.g., support vector machines, "random 
forests" ) . 

3. Lastly, a qualification of the model is obtained by testing and quantifying how 
well it predicts the "future" beyond the interval used for calibration/initialization. 

Levels of ignorance on the observation Q 

• First level: The characteristics of the noise are known, such as its distribution, 
covariance, and maybe higher-order statistics. 

• Second level: It may happen that the statistical properties of the noise are poorly 
known or constrained. 

• Third level: A worse situation is when some noise components are not known to 
exist and are thus simply not considered in the treatment. For instance, imagine 
that one forgets in climate modeling about the impact of biological variability 
in time and space in the distribution of CO2 sequestration sites. 

• Fourth level: Finally, there is the representation error in Q itself, i.e., how Q is 
modeled mathematically in Ti. 
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Consequences of the sensitivity to initial conditions and 
nonlinearity in the model 

Even an accurate forecast is limited by the inherent predictabihty of the system. 
In the same way, vahdation may be hindered by Umited access to testing. The pre- 
dictabihty of a system refers to the fundamental limits of prediction for a system. For 
instance, if a system is pure noise, there is no possibility of forecasting it better than 
chance. Similarly, there may be limits in the possibilities of testing the performance 
of a model because of limits in measurements, limits in access to key parameters for 
instance. With such limitations, it may be impossible to fully validate a model. 

A well-known source that limits predictability is the property of sensitivity to 
initial conditions, which is one of the ingredients leading to chaotic behavior. Vali- 
dation has to be made immune to this sensitivity upon initial conditions, by using a 
variety of methods, including the properties of attractors, their invariant measures, 
the properties of Lyapunov exponents, and so on. Pisarenko and Sornette |125| have 
shown that the sensitivity upon initial conditions leads to a limit of testability in 
simple toy models of chaotic dynamical systems, such as the logistic map. They 
addressed the possibility of applying standard statistical methods (the least square 
method, the maximum likelihood estimation method, the method of statistical mo- 
ments for estimation of parameters) to deterministically chaotic low- dimensional 
dynamic system containing an additive dynamical noise. First, the nature of the 
system is found to require that any statistical method of estimation combines the 
estimation of the structural parameter with the estimation of the initial value. This 
is potentially an important lesson for such a class of systems. In addition, in such 
systems, one needs a trade-off between the need of using a large number of data 
points in the statistical estimation method to decrease the bias (i.e., to guarantee 
the consistency of the estimation) and the unstable nature of dynamical trajectories 
with exponentially fast loss of memory of the initial condition. In this simple exam- 
ple, the limit of testability is reflected in the absence of theorems on the consistency 
and efficiency of maximum likelihood estimation (MLE) methods [125] . We can use 
MLE with sometimes good practical results in controlled situations for which past 
experience has been accumulated but there is no guarantee that the MLE will not 
go astray in some cases. 

This work has also shown that the Bayesian approach to parameter estimation of 
chaotic deterministic systems is incorrect and probably suboptimal. The Bayesian 
approach usually assumes non-informative priors for the structural parameters of 
the model, for the initial value and for the standard deviation of the noise. This ap- 
proach turns out to be incorrect, because it amounts to assuming a stochastic model, 
thus referring to quite another problem, since the correct model is fundamentally 
deterministic (only with the addition of some noise). 

This negative conclusion on the use of the Bayesian approach should be con- 
trasted with the Bayesian approach of Hanson and Hemez 126 to model the plastic- 
flow characteristics of a high-strength steel by combining data from basic material 
tests. The use of a Bayesian approach to this later problem seems warranted because 
the priors reflect the intrinsic heterogeneity of the samples and the large dispersion 
of the experiments. In this particular problem concerning material properties, the 
use of Bayesian priors is warranted by the fact that the structural parameters of the 
model can be viewed as drawn from a population. It is very important to stress this 
point: Bayesian approaches to structural parameter determination are justifled only 
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in problems with random distributions of the parameters. For the previous problem 
of deterministic nonlinear dynamics, it turns out to be fundamentally incorrect. We 
therefore view proper partition of the problem at hand between deterministic and 
random components as an essential part of validation. 

Extrapolating beyond the range of available data 

In the previous discussion, the limit of testability is solely due to the phenomenon of 
sensitive dependence upon initial conditions, as the model is assumed to be known 
(the logistic map in the above example). In general, we do not have such luxury. Let 
us illustrate the much more difficult problem by two examples stressing the possibil- 
ity for the existence of "indistinguishable states." Consider a map /i that generates 
a time series. Assuming that /2 is unknown a priori, let us construct/constrain the 
map /2 whose initial condition and parameters can be tuned in such a way that tra- 
jectories of /2 can follow data of /i for a while, but eventually the two maps diverge. 
Suppose that the time series of /i is too short to explore the range expressing the 
divergence between the two maps. How can we (in-)validate as a incorrect model 
of /l? 

This problem arises in the characterization of the tail of distributions of stochas- 
tic variables. For instance, Malevergne, Pisarenko and Sornette [127| have shown 
that, based on available data, the best tests and efforts can not distinguish between 
a power law tail and a stretched exponential distribution for financial returns. The 
two classes of models are indistinguishable, given the amount of data. This fun- 
damental limitation has unfortunately severe consequences, because choosing one 
or the other models involves different predictions for the frequency of very large 
losses that lie beyond the range sampled by historical data (the /i — /2 problem). 
The practical consequences are significant, in terms of the billions of dollars banks 
should put (or not) aside to cover large market swings that are outside the data set 
available from the known past history. 

This example illustrates a crucial aspect of model validation, namely that it 
requires the issuance of predictions outside the domain of parameters and/or of 
variables that has been tested "in-sample" to establish the (calibrated or "tuned") 
model itself. 
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